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Preface 


This book is intended as a text for Mathematics students taking their first course in 
Statistics, and grew out of my second-year course for mathematics undergraduates at 
EPPL. It is a book on “Statistics for Mathematicians” rather than on “Mathematical 
Statistics”: the intent is not to focus on the deeper mathematical/theoretical aspects 
of the subject but rather to provide an introduction to the basic notions tailored 
to the mindset and tastes of the Mathematics student. Mathematics students are 
sometimes put off by the informal nature of first courses in Statistics, since many 
results are usually stated without proof or are accompanied by heuristic sketches 
of proofs. Another risk may be that of “intellectual entropy”, when too many (and 
diverse) topics are covered in a single course, risking the impression of Statistics 
as a collection of recipes lacking natural connection. This book can be used as a 
basis for an elementary semester-long first course in Statistics that presents the basic 
ideas of one-parameter inference in a coherent manner, while making essentially no 
sacrifices on matters of rigour. It is meant to be compact, so as to be realistic to 
be covered in full during a single semester, and yet hopefully attract mathematics 
students to pursuing further elective courses in Statistics. In more detail, the three 
main tasks this text sets out to address are as follows. 


(1) To provide a rigorous yet elementary course The effort is to prove essentially 
all the results rigorously. These results include some of the most central results such 
as the asymptotics of maximum likelihood, optimality in testing, asymptotics of 
likelihood ratio tests, and optimality results regarding confidence intervals. It also 
contains detailed proofs of some elementary results that are rarely worked out in 
detail in elementary texts (for instance, the derivation of the distribution of the ¢ 
statistic). The only results not proven in the main text are some background results 
in probability and analysis. In the case of the probabilistic results, detailed proofs are 
in fact given in the appendix, and the proofs are still at an elementary level. These 
include results such as the continuous mapping theorem, Slutsky’s theorem, the 
(third moment) central limit theorem, and results pertaining to moment generating 
functions. The analytic results not proven are Taylor’s formula and the univariate 
inverse function theorem. These are stated in the appendix, where precise references 
are also provided for their proofs. In principle, thus, the course only requires 
students to have taken a first course in €/5-level analysis (including sequences, con- 
vergence, series, multivariable differentiation and Riemann integration, and Taylor’s 
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formula) and a first course in probability (including basic operations on events and 
the corresponding probability calculus, discrete and continuous random variables, 
joint/conditional/marginal distributions, and expectation/variance/covariance). A 
succinct fact sheet on all the probabilistic prerequisites is provided in the appendix, 
for easy reference. 


(2) To provide a conceptually compact course, with a firm sense of direction 
The entire book can realistically be covered in full during the course of a semester, 
and it is also realistic for the students to solve all the exercises during the same 
period of study (a solution manual is available upon request for instructors). I have 
reduced the number of topics covered in order to be able to have the minimal number 
of topics that can be covered during a semester course without compromising on 
the mathematics, while still providing an overview of the main ideas of statistical 
inference. The course covers the basics of exponential families, exploratory data 
analysis, sampling, estimation, testing, and confidence intervals. It’s true that the 
book does not tell the whole story and avoids detailed discussions of all the 
possible complications and variants in each section. However, I believe that the 
topics covered give a firm basis for the students to build on, and every attempt has 
been made for the story it tells to flow naturally, without giving the appearance 
of a collection of techniques. There is extensive cross-referencing of the material, 
illustrating how the different results are tied together, and an effort to develop the 
material in a “linear fashion’, explaining why one is doing whatever they are doing 
at every point, and what the ultimate purpose is. No result is mentioned in vain (any 
results presented are subsequently used), and results are accompanied by substantial 
motivation and discussion. References made to results are always accompanied by 
the number of the said result, along with the page number in the book which allows 
for easy reference and self-study. 


(3) To provide a course that is not on “Mathematical Statistics” but rather 
on “Statistics for Mathematicians” The audience is primarily intended to be 
undergraduate mathematicians, whom I hope to attract into Statistics rather than 
statisticians to whom I might want to introduce the more mathematical aspects of 
Statistics. Therefore, the course is not primarily intended to be a course in statistical 
theory. Rather, it is intended to be an entry-level course in statistical inference, 
presented in a way that would be more receptive by an audience comprised of 
mathematicians. Therefore, the discussion of different topics and the style and 
considerations are adapted to such an audience. For example, optimality, whenever 
discussed, is not presented as an end in itself but rather as a means of motivating 
methodology (the idea being that mathematicians would be motivated by “best” 
results more than by heuristics). 

The means to balance the requirement of an elementary yet rigorous text was 
to adopt the use of the exponential family of distributions throughout (rather than 
aiming for full generality). This is of course a restriction, but in some ways not 
a major one, since most of the examples treated in elementary textbooks are, in 
fact, exponential families. Focusing on exponential families not only allows for 
elementary proofs using basic analysis and probability but also allows for the 
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statements of the theorems and the required conditions to be simple and intuitive. 
Whenever results do hold more generally, this is remarked as a side note. A more 
detailed description of the structure of the text, and the progression of topics, can be 
found in the “Brief Overview” Section (p. 1). 

The main concessions that regrettably had to be made in terms of coverage 
pertain to regression and the Bayesian paradigm, and this deserves an apology. The 
textbook is based on the first Statistics course that mathematics students take at 
EPFL, but this course is also the only compulsory course in Statistics. It may thus 
well be their last (though hopefully the book will convince them otherwise). In this 
case there is a dilemma. Does one strive to include as many topics as possible, so 
that the student be well equipped in the future in case this is all the Statistics they 
will ever see? Or does one try to cover a minimally sufficient number of topics 
as clearly and completely as possible, hoping that at least these topics will stick 
to mind? I opted for the second approach, as my impression is that adding more 
topics does not guarantee that these topics will in fact be remembered (in fact, a 
student having only taken a single Statistics course and finding themselves needing 
Statistics later will almost certainly have to do further reading anyway) and because 
this approach is more in line with the effort to produce a course with low conceptual 
entropy. For instance, notions such as p-values and confidence intervals are quite 
subtle to understand upon a first encounter (avoiding flawed interpretations such as 
“the probability that Ho is valid” or “the probability that the parameter falls in the 
interval is 95%”). When the student does not already have a solid grasp, it may be 
unsettling—or worse still confusing—to suddenly switch things around. 

In writing the book, and preparing examples and exercises, I have drawn inspira- 
tion from many excellent textbooks that have stood the test of time (but also more 
recent online resources, including Wikipedia and mathstackexchange). In doing this, 
I tried to balance the rigour found in advanced textbooks focusing on Mathematical 
Statistics, with the more accessible style of entry-level textbooks focusing on the 
basics of statistical inference. The former category includes Lehmann and Casella 
[15], Lehmann and Romano [16], Cox and Hinkley [6], Bickel and Doksum [1], 
Schervish [22], Shao [23], and Young and Smith [26], and the latter category 
includes Rice [19], Hogg and Tanis [13], Hogg and Craig [12], and Silvey [24] 
(the last one perhaps bordering with the first category). The book by Knight [14] 
strikes a very nice balance between the two objectives, though still at a level higher 
than the present text aims, and has also been an important source of inspiration 
and exercises/examples. More texts striking a good balance and including a more 
comprehensive list of topics than the present one (but still not including several 
proofs) include Casella and Berger [4], Davison [9], and Wasserman [25]. The 
necessary probability background for the present text is covered quite nicely in the 
first three chapters of Knight [14], but of course there are several texts devoted 
specifically to elementary probability (i.e. non-measure theoretic probability) that 
would suffice (e.g. Blitzstein and Hwang [3], Dalang and Conus [8] (in French), 
Grimmett and Welsh [11], Pitman [18], and Ross [20]). As mentioned earlier, 
Sect. A.1 contains a quick overview of the main prerequisites, for ease of reference. 
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While the main audience for the book will be instructors and students in 
mathematics undergraduate programmes, the textbook could still be used for 
programmes of study with substantial mathematical content, for instance, students 
of physics, economics, computer science, and engineering programmes looking 
for a more formal coverage of one-parameter inference. After all, to think like a 
mathematician is to think rigorously, regardless of the subject matter at hand. 

In closing, I would like to express my gratitude to my PhD students 
and my undergraduate students whose meticulous comments and suggestions 
helped improve earlier drafts. Marie-Héléne Descary, Mikael Kuusela, Valentina 
Masarotto, Matthieu Simeoni, and Yoav Zemel provided extensive feedback, 
suggestions on exercises, and help with proofreading and layout. I especially 
enjoyed chatting with Yoav Zemel about how to best tiptoe around measure theory 
in the proofs of some more delicate results in the appendix (while remaining fully 
rigorous). I am also very thankful to two anonymous reviewers, who read a first 
version of the book and gave constructive and encouraging feedback. Any remaining 
glitches are, of course, my own. Finally, I would like to thank Veronika Rosteck and 
Springer/Birkhauser for our pleasant collaboration. 


Lausanne, Switzerland Victor M. Panaretos 
October 2015 
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Brief Overview 


In a general sense, one can describe Statistics as the mathematical discipline whose 
purpose is to 

use empirical data generated by a random phenomenon, in order to 

make inferences about some deterministic characteristics of the phe- 
nomenon 
while simultaneously quantifying the uncertainty inherent in these infer- 
ences. 
Let’s take a step back and consider the different elements of this description. What 
is a random phenomenon? We can think of a random phenomenon as a system 
or process whose outcome X is uncertain. This means that, even if we know 
every aspect of this system or process, we cannot perfectly predict its outcome X. 
Mathematically, such phenomena are formalised via the theory of probability: the 
outcome X is arandom variable, and the model that describes the phenomenon is the 
probability distribution function F(x) = P[X < x] of this random variable. Now 
there may be a characteristic 0 of this phenomenon that influences the probabilities 
associated with the outcome of X. Such a characteristic is called a parameter. Since 
the probability of {X < x} is influenced by 0, the function F(x) must be a function 
of 0, so we write it as F(x; 0) = Po[X < x]. 

If we know the functional form of F(x; 6), and the true value of 0, we can 
then calculate the probability Pe[X < x] = F(x; 6) for any possible outcome 
x. Statistics deals with the inverse problem: suppose that we know the precise 
functional form of F(x; 6), but do not know which is the true @. If we have an 
outcome x (a realisation of X), is it possible to say something useful about 0? 
It seems that we should be able to do so. Since @ influences what outcomes are 
most probable, then knowing an outcome should give us information on which 6 
are plausible. The topic of this text will be how exactly to make this connection 
rigorous and show how to exploit it in order to (a) make the best possible use of our 
data x to better inform ourselves about @ and (b) understand how certain we can be 
about our inferences on @ for the given data x. In summary, our framework is as 
follows: 


1. There is a distribution F(x; 6) depending on an unknown 6 € R?. 
2. We observe the realisation of m independent identically distributed random 
variables X,,..., X,, that follow this distribution. 
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3. We wish to use our 7 observations (the realisations of X,,...,X,) in order 
to make statements about the true value of @ and to quantify the uncertainty 
associated with those statements. 


At first glance, this framework may seem restrictive. Indeed, it represents a 
significant simplification over the much broader framework where one can develop 
statistical methodology. For example, in general, the unknown parameter of interest 
@ might not be an element of R’”, but an element of a more general mathematical 
space (a space of functions, for instance). Also the data (X),...,X,) could be 
dependent; they could themselves be vectors, or functions, or some more general 
mathematical object. 

However, some of the key ideas that statisticians employ in order to attack 
these more general situations are already present in the simpler scenario that we 
will consider in this text. In fact, many highly more complex situations can often 
be reduced to this simpler case by a careful use of mathematics (for example, a 
real function can be identified with a vector in R? when represented by its basis 
coefficients in some basis expansion, a dependent collection of random variables 
might in fact be approximately independent, and so on). In a sense, the framework 
we will consider here is the simplest non-trivial case that nevertheless contains the 
germs of generality. 


Following is an overview of the contents of this text: 

1. In Chap. 1, we will review the different types of probability models that we will 
construct statistical methods for. We will try to understand what situations they 
are suitable for, and what are some of their key properties. We will also try to find 
a unifying framework in which we can describe several of these models at once: 
instead of developing results separately for each model, we will try to give an 
abstract description of some key common characteristics that will be useful for 
obtaining general results. At the end of the chapter, we will consider the problem 
of how to choose a type of model, whether by first principles or by means of 
exploratory data analysis using numerical and graphical summaries. 

2. In Chap. 2, we will develop the relevant concepts and probabilistic results that 
are needed in order to study the problem of sampling from probability models. 
We will probe the behaviour of the random sample, and how this relates to the 
original model, and what aspects of a sample are important for the purposes 
of statistical inference. An important focus will be to describe the probabilistic 
behaviour of functions of a sample. That is, given a sample X|,..., X), froma 
distribution F’, what is the distribution of g(X),..., X;,) for some function g? 
The reason we will do this is simple: all that we have available to do statistics is 
the sample, so anything we do will be a function of the sample! 

3. Once we know what probability models we wish to consider, and how to handle 
samples from probability models, we will turn to the most basic statistical 
inference question one can ask: given a sample X,,..., X;, from a distribution 
F that depends on an unknown parameter 0, construct an estimator: a function 
of the sample whose purpose is to estimate 9. We will consider how to formalise 
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the quality of such an estimator in terms of quantifying its accuracy, and what 
are methods for constructing good estimators (for example, are there optimal 
methods’). 

4. Chapter 4 deals with a somewhat different problem. Instead of trying to guess 
which @ was the one that generated the observed sample X),..., X,, we will 
attempt to answer the following question: given a candidate value 6 for 6 (or 
some candidate values forming a set ©), decide on the basis of the sample 
X,,...,X, whether this value (or set of values) is good guess for the true 0. An 
important part of the chapter will be devoted to making formal what we mean by 
candidate values, good guesses (and bad guesses), and whether there are optimal 
strategies to do so. We will also be considering how to quantify the accuracy of 
our decisions. 

5. Finally, in Chap.5, we will deal with the third of the basic trio of problems of 
statistical inference: confidence intervals. Roughly speaking, instead of trying 
to estimate the precise value of 6 that generated our sample X1,..., Xn, we 
wish to provide a whole range of values in the form of some interval, which will 
very likely contain the true parameter 0. This chapter will formalise this notion 
and consider how we can construct “small” regions that have high probability 
of covering the true parameter 0. We will, in fact, see that the problem of 
constructing confidence intervals is very closely connected both with the problem 
of point estimation and with the problem of hypothesis testing. 


Regular Probability Models 


Before setting out to explore how we can use statistics in order to learn about 
the structure of probability models given data from these models, we must first 
specify what types of probability models we shall consider (and some of their 
basic properties). For the purposes of this course, a probability model will be the 
distribution F' of a random variable X which takes values in some subset of the real 
line R: 


F(x) =P[X < x], xeR. 


We write X ~ F to state that F is the distribution of X.. If {X;};e, is a collection of 
independent identically distributed random variables with distribution F', we write 


Xj X F. The distribution F will typically depend on one or several parameters 
that we shall represent as 0 = Giiceagbp)” € © C R? (depending on the 


context, a different Greek letter or a Latin letter may be used). The space © 
where the parameter 6 belongs is called the parameter space. To indicate that the 
distribution F depends on the parameter 0, we will often write Fg or F(x; 0). All 
of the examples we will see and most of the theory we will develop will pertain to 
probability models that we shall call regular. 


Definition 1.1 (Regular Parametric Probability Models) 
Let X be a real-valued random variable, and let Fg be its distribution function, 
for 6 a parameter with parameter space © C R?. The probability model {Fy : 
6 € ©} will be called regular if one of the two following conditions holds: 

1. For all 6 € ©, the distribution Fy is continuous with density f(x; 6). 

2. For all 6 € ©, the distribution Fg is discrete with probability mass function 
J (x; 6) such that S77 f(x; @) = 1 forall 6 € O. 


Simply put, the model Fg cannot switch between continuous and discrete 
depending on the value of @. And, if it is discrete, the sample will always be taken 
to be a subset of the integers (e.g. it cannot be Z + 6, where @ € [0, 1]). The set 
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X := {x € R: f(x; 0) > O} will be called the sample space of X (note that 
X could depend on @, but it will always satisfy 7 C R in the continuous case, or 
X C Z in the discrete case). 

We will now review several regular probability models and their basic charac- 
teristics, explain what situations they are appropriate as models for, and give some 
illustrative examples. 


Remark 1.2 (Notation Py and Eg) When F depends on a parameter 0, we still 
have 


F(x;@) = P[X < x]. 


Since the left-hand side depends on 0, the right-hand side also must depend on 6, 
even though this is not explicit in our notation. Sometimes we will need to make that 
clear, in which case we will write P, instead of just P in order to remind ourselves 
of this dependence. Similarly, we will sometimes write Eg instead of just E for the 
expectation of X when its distribution is F(x; 0). 


1.1 Discrete Regular Models 


Perhaps the simplest imaginable probability model is the Bernoulli distribution. 
This models a situation where there are only two possible outcomes, often termed 
“success” and “failure”. The prototypical example is that of flipping a coin, where 
success (say heads) has probability p and failure (tails) has probability 1 — p. 


Definition 1.3 (Bernoulli Distribution) 
A random variable X is said to follow the Bernoulli distribution with parameter 
p € (0, 1), denoted X ~ Bern(p), if 

1. ¥ = {0,1}, 

2. f(x; p) = pl{x = 1} + (— p)U{x = 0}. 
The mean, variance and moment generating function of X ~ Bern(p) are 
given by 


[X] = p, Var[X] = p(1— p), M(t)=1—p+ pe’. 


Example 1.4 


Almost any random phenomenon whose outcomes may be classified in one of two categories can 
be modelled via the Bernoulli distribution. We simply name one category as success and the other 
as failure (success is usually the case we are most interested in). 

1. Sample a voter from some large electorate (so large that we take it to be countably infinite) right 
after the ballots have closed, and let X be the vote she cast in the referendum. Then X¥ = 1 
(yes) with probability p and X = 0 (no) with probability 1 — p, where p is the proportion of 
voters who voted yes. 
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2. Consider a sonogram that is made with the purpose of determining the sex of a foetus. The 
outcome X can either be X = | (girl) or X = O (boy), with some probabilities p and | — p, 
respectively. The value of p in this case is determined by many and diverse environmental 
factors, but in general can be considered to be constant within homogeneous populations. 

3. Consider a quantum measurement on the spin of an electron in a particle system. The outcome 
can either be 1 (spin up) or 0 (spin down) with probabilities p and 1 — p. The value of the 
parameter here depends on the particular physical properties of the system. 

4. Consider the barometric pressure in the lake Geneva region on a typical summer day. This might 
be high (if above a certain threshold) or low (otherwise), and these two outcomes may be coded 
as | and 0, respectively. Their corresponding probabilities, p and 1 — p, are determined by 
several environmental factors. 

5. More generally, we may create a Bernoulli random variable Y from any other random variable 
X in the following way. Let A C % be some event in the sample space of X, and define 
Y = 1{X € A}. Then Y has a Bernoulli distribution with p = P[X € A]. Here, we interpret 
success as the realisation of X lying in A. 


EJ) 


More often than not, we have several independent repetitions of an experiment 
with two possible outcomes, say “success” and “failure” and we wish to model the 
total number of successes. If the individual experiments are modelled as Bernoulli 
experiments, then we are inevitably led to the binomial distribution. This models 
the total number of heads in a sequence of 1 independent coin flips. 


Definition 1.5 (Binomial Distribution) 
A random variable X is said to follow the binomial distribution with parameters 
p € (0,1) andn EN, denoted X ~ Binom(n, p), if 

1. ¥ = {0,1,2,...,n}, 


n = 
2. f(xin, p) = . pup: 


The mean, variance and moment generating function of X ~ Binom(n, p) are 
given by 


u[X] = np, Var[X] = np(1 — p), M(t) =(1— p+ pe')’. 


Exercise 1 Show that if X = )°/_, Y; where Y; i Bern(p), then X ~ 
Binom(n, p). 


Example 1.6 


Since the binomial is a sum of independent Bernoulli random variables, we can expect that our 

previous examples can be extended to give us examples of using the binomial distribution (though 

this is not the case with all of them: we need both independence and equal probabilities of success 

for a binomial distribution to be induced). 

1. Sample n voters from the same infinite electorate right after the ballots have closed, and let Y 
be the number of voters in that sample who voted “yes”. Then Y is binomial with 7 trials and 
success probability p, where p is the proportion of voters who voted yes. 
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2. Consider a particle system with the property that the spin of individual particles is independent 
of all others. If there are 1 particles, then the number Y of spin up particles is binomially 
distributed with parameters n and p, where p is as before, and is related to the electromagnetic 
properties of the system. 

3. Consider again the barometric pressure in the lake Geneva region on a typical summer day, 
which can be high or low, with corresponding probabilities p and 1 — p. Let Y be the number 
of days with high barometric pressure within a period of n consecutive days. Though Y is a 
sum of Bernoulli variables, it is not a Binom(n, p). The reason is that the pressure conditions 
are dependent between consecutive days (hence the Bernoulli trials are not independent). 

4. Going back to the sonogram example, suppose that the probability of a given foetus being of 
female sex is p. Consider now a sonogram whose purpose is that of of determining the number 
of foetuses of female sex among two foetuses being gestated by the same woman (twins). The 
outcome Y can either be 0, or | or 2. If we know whether the twins are non-identical (say this 
is an event called A), then: 


P[Y = 0|A] = (1 — p)’, P[Y = 1A] = 2p — p), P[Y = 2|A] = p’. 


In other words, given that the twins are non-identical 


2 
P[Y = y|A] = ( Joan, y =0,1,2, 
y 


and so Y is indeed binomial given A. However, if we do not know whether the twins are non- 
identical, we factor in the possibility that the twins might be identical. In this case: 


PY = y] = PIY = ylAlPLA] + PY = ylA IPL 
= (;) p’ (1 p*P[A] + (pity = 2} + (1 = p) ty = 03)PLA‘]. 


If P[A‘] ¥ 0, this expression will in general not be expressible as a binomial probability mass 
function, and so Y may not be binomial. This example highlights that dependence between trials 
may be subtly disguised, and that one must think carefully about the nature of the probability 
experiment before proceeding with a specific model. 


Oo 


Suppose now that we start a sequence of independent Bernoulli trials, say coin 
flips, and we continue flipping the coin until the first time we get heads (success). 
The number of tails (failures) until the first apparition of heads (the first success) 
has the geometric distribution. 


Definition 1.7 (Geometric Distribution) 
A random variable X is said to follow the Geometric distribution with parameter 
Pp € (0, 1), denoted X ~ Geom(p), if 

1. ¥ = {0} UN, 

2. f(x; p) = (1— p)*p. 
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Binomial Distribution PMF 
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Fig. 1.1 Binomial probability mass functions for different values of the parameters n and p 


The mean, variance and moment generating function of X ~ Geom(p) are given 
by 


(1 — p) _ Dp 
Pp a 1-(1— pet’ 


E[X] = —?. Var[X] = 


fort < —log(1— p). 


Exercise 2 Let {Y;};>1 be an infinite collection of random variables, where Y; He 
Bern(p). Let T = minfk e N: % = 1}-—1. Then T ~ Geom(p) (Figs. 1.1 
and 1.2). 


What about the distribution of the number of failures until the rth success in a 
sequence of Bernoulli trials? This follows the negative binomial distribution (also 
known as the Polya distribution). 
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Geometric Distribution PMF 
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Fig. 1.2 Geometric probability mass functions for different values of the parameter p 


Definition 1.8 (Negative Binomial Distribution) 
A random variable X is said to follow the negative binomial distribution with 
parameters p € (0,1) andr > 0, denoted X ~ NegBin(r, p), if 

1. ¥ = {0} UN, 


= 
2. fxs pir) = (" ip Ju ~ yp. 


The mean, variance and moment generating function of X ~ NegBin(r, p) are 
given by 


r 


l—p (1— p) Pp 


E[X]=r Var[X] =r Pe MO= Tra_peyr for t < —log(1— p). 


Exercise 3 Show that if X¥ = )’’_,Y; where Y¥; “ Geom(p), then X ~ 
NegBin(r, p). Deduce the mean, variance and moment generating function of X. 


What if we would like to count the number of successes not within a discrete set 
of trials but within a bounded uncountably infinite set, such as an interval? For 
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example, the total number of calls in a call centre within a 10-min interval. In 
principle, the phone could ring at any instant of time—but there are uncountably 
infinite instants (=trials) within the 10-minute interval! It turns out that such a 
distribution exists, provided that the probability of a success for any given instant is 
“very small”, and it is called the Poisson distribution. 


Definition 1.9 (Poisson Distribution) 
A random variable X is said to follow the Poisson distribution with parameter 
A > 0, denoted X ~ Poisson(A), if 
1. ¥ = {O}UN, 
‘Ax 
2. f(x;A) = ae 
x! 
The mean, variance and moment generating function of X ~ Poisson(A) are 
given by 


WX] =A, Var[X] =A, M(t) = exp{A(e’ — 1)}. 


Exercise 4 Let X; ~ Poisson(A). Show that Y = )~_, X; ~ Poisson(nA). 


Exercise 5 Let X¥ ~Poisson(A) and Y~Poisson(j1) be independent. Show that the 
conditional distribution of X given X¥ + Y = k is Binom(k,A/(A + y)). 


It would seem that the Poisson distribution came out of nowhere, whereas 
the other distributions we considered were linked with the Bernoulli distribu- 
tion. It turns out that there is an important connection between the Poisson and 
Binomial distributions. Roughly speaking, a Poisson distribution is the limit of 
a Binomial distribution when n — oo and p = A/n (the number of trials 
diverges to infinity but the probability of success decreases to zero linearly with 
respect to the number of trials). This link also helps us make precise mathe- 
matical sense of the way we motivated the Poisson distribution. It is the Law 
of Rare Events, and will be stated rigorously in Exercise 24 (p. 54) (Figs. 1.3 
and 1.4). 


Example 1.10 


We list here some random experiments for which the Poisson distribution is a reasonable 
probability model. All of these involve modelling counts over a finite time horizon, when there 
is no a priori upper bound on the total. 

1. The number of visits to a website during a given day can be well modelled by a Poisson 
distribution. The parameter of the Poisson distribution will be interpreted as the mean number 
of visits on that day. 

2. The yearly number of earthquakes in a given bounded spatial region is typically Poisson 
distributed, with parameter equal to the mean number of earthquakes per year in that 
region. 
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Negative Binomial Distribution PMF 
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Fig. 1.3 Negative binomial probability mass functions for different values of r and p 


3. Radioactive materials have unstable atoms, which emit particles (such as alpha particles and 
gamma rays). Quantum theory postulates that, at the level of each atom, the number of 
particles emitted within a given fixed time interval is random. The typical model for this 
random variable is a Poisson distribution with mean given by the decay constant of the 
material. 

4. In positron emission tomography, we attempt to image the interior of the human body in order 
to detect features of interest, for example cancers. A tracer is injected into the human body 
that emits positrons. This tracer is spread throughout the human body, but concentrates more 
in tissue with high metabolic activity (e.g. a cancerous tissue). By counting the number of 
positrons emitted at a given physical location, we have an indication of the metabolic activity 
in that location. The number of particles emitted at a given location typically behaves like 
a Poisson distribution with mean parameter given by the concentration of the tracer at that 
physical location. In other words, the intensity of the tomography image obtained at any 
pixel is Poisson distributed with mean given by the true concentration of the material at that 
pixel. 


Oo 
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Poisson Distribution PMF 


— rA=2 
— = 4 

st _| — A=8 

o — iA=10 

oO _| 

oO 

0-0 
= 
= 
N J 
: “\ 
° 
Z ° 
rs gomo~ 
) oOo OR se 
= oY eB xs 
(=) ° fe} ° on 
° V4 ° 
) oo X \ 
F ° oN 
0,0 0 6 ° 
° ae SG Ole, on, 
S- o-e-0- ~0-0-0-0-8=0-0-0-0-8=8=0=8=6= 


T T T T T 
0 5 10 15 20 


Fig. 1.4 Poisson probability mass functions for different values of the parameter A 


1.2 Continuous Regular Models 


We now switch to the continuous case, and consider some of the key probability 
models for random variables taking values in R. To define these, it suffices to 
determine their probability density function. We first consider one of the simplest 
continuous probability models: a random variable that is “equally likely” to take 
values anywhere on a bounded interval. 


Definition 1.11 (Uniform Distribution) 
A random variable X is said to follow the uniform distribution with parameters 
—o0 < 0; < 02 < co, denoted X¥ ~ Unif(@, 62), if 


(02 _ 6)! ifx € (01, >), 


otherwise. 


fx (x; 6) = 
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The mean, variance and moment generating function of X¥ ~ Unif(6,, 2) are 
given by 


tO Oy 


ef _ ef 


E[X] = (61+62)/2, Var[X] = (G2-Ay'/12, MO = Tea 


t#0,M(0)=1. 


In a discrete setting, the uniform distribution gives equal probability to any 
possible simple outcome from within the finite sample space of outcomes. In 
the continuous case, the probability of observing a specific number in (6, 62) is 
precisely zero, but uniformity is understood in the sense that the probability of 
observing an outcome falling in a given subinterval of (6), 02) is proportional to 
the length of that interval. 


Example 1.12 


The uniform distribution is as spread out as possible over a finite interval. In that sense, it can be 
used to model situations where we have “complete ignorance”, where we are not prepared to make 
any assumptions, or where the phenomenon under study is highly unpredictable. 

1. Suppose that our bus is supposed to pass every 10 min, and we arrive at a random moment at 
the bus stop, without knowing the schedule. It is natural to model our waiting time by a uniform 
distribution on (0, 10). 

2. Suppose that our compass is broken, and the needle moves freely. Then, if we move in the 
direction that the compass indicates for “north” at some random moment, the true direction we 
will move in can be modelled as a random variable with the uniform distribution on (0, 277) 
(where we can imagine 7/2 to correspond to the true “north”). 

3. Consider the movement of excited gas molecules (in high temperature) in a container shaped 
as a cube of edge length 1. If we let the molecules move freely inside the container, and then 
ask for the location of a specific molecule after some time f (where ¢ is large), the coordinates 
of this location (X, Y, Z) can be modelled very accurately by iid uniform random variables on 
(0, 1), regardless of the starting point of the molecule. 

Oo 


Our next model is typically appropriate when we wish to model the time elapsed 
until the occurrence of a certain event, or between events, when this time is random. 


Definition 1.13 (Exponential Distribution) 
A random variable X is said to follow the exponential distribution with parameter 
A > 0, denoted X¥ ~ Exp(A), if 


de **, ifx>0 


xiA)= 
Fact ) 0 ifx <0. 
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The mean, variance and moment generating function of X ~ Exp(A) are given 
by 


LX] = a7, Var[X] = A7~, M(t) = beads 


Note the interpretation here: A~! is the average time until the occurrence of the 
event of interest (measured in some given unit of time). So A is interpreted as a 
rate parameter. The exponential distribution can be considered to be the continuous 
version of the geometric distribution, when the number of trials becomes large, and 
the probability of success becomes small. 

A crucial property of the exponential distribution is that it is “memoryless”: 
no matter how long you’ve been waiting already, the probability of waiting for an 
additional amount of time x only depends on x, and not your past waiting time: 


Exercise 6 Let X ~ Exp(A). Then P[X > x+1t|X >t] = P[X = x]. 


The exponential distribution is, in fact, the unique distribution on [0, co) with 
this property (see Exercise 14, p. 27). Therefore, when choosing the exponential 
distribution as a model for a random time, we must always ask if it is reasonable 
to assume that this random time has the lack of memory property (Figs. 1.5 
and 1.6). 


Example 1.14 


The exponential distribution has important connections to the Poisson distribution. Roughly 
speaking, if the time between consecutive occurrences of a certain phenomenon is independent 


1 CO SS 
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Fig. 1.5 Uniform probability density function for general values of (1, 62) 
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Fig. 1.6 Exponential probability density functions for different values of the parameter A 


exponential, then the number of occurrences of the phenomenon up to a given time will be Poisson. 

For example: 

1. The time between two consecutive occurrences of an earthquake at a given spatial region can 
be modelled as an exponential random variable. 

2. The time between consecutive emissions of alpha particles from an atom of a radioactive 
material is very well modelled by an exponential distribution. The rate of this exponential 
distribution will be intimately related to the decay constant of the material. 

3. The amount of time between two consecutive visits at a website can also be modelled by an 
exponential distribution. 


Oo 


Exercise 7 Let X,Y be independent exponential random variables with rates 1, 
and A». Prove that Z = min{X, Y} is also exponential with rate A; + Ao. 


Now suppose that we are interested in the time until the rth event, in a situation 
where the times between events are distributed as iid Exp(A). This resembles the 
discrete situation where we are waiting until the rth success in a sequence of 
Bernoulli trials, which takes us from the geometric to the negative binomial (the 
negative binomial distribution being the sum of r iid geometric random variables). 
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It turns out that the sum of r iid exponential random variables has a gamma 
distribution: 


Definition 1.15 (Gamma Distribution) 

A random variable X is said to follow the gamma distribution with parameters 
r > Oand A > 0 (the shape and scale parameters, respectively), denoted X¥ ~ 
Gamma(r, 4), if 


A yr—le Ax ‘ 
Ix Gerd) = | Tore, ifx20 
0 


ifx <0. 


The mean, variance and moment generating function of X ~ Gamma(r, A) are 
given by 


(xX) =r/d, Var[X] = r/A?, M(t) = (4). t<d. 


Note that the way we have defined the gamma distribution does not restrict r 
to be a natural number. It is indeed true that a gamma distribution can be defined 
more generally for r > 0. The interpretation as a sum of r exponentials of rate A 
will only be valid when r happens to be a positive integer. The gamma distribution 
can provide a flexible model for a wide variety of phenomena that give rise to non- 
negative random variables. The suitability of these models is not always completely 
founded on concrete physical principles. It is sometimes dictated by convenience, 
and other times by extensive practical experience. 

The function I(y) is the gamma function (from which the distribution inherits 
its name). In the special case when r is a positive integer, '(r) = (r — 1)!. There 
is a particular special case of the Gamma distribution, known as the chi-squared 
distribution, that is especially important in statistical theory and practice: 


Definition 1.16 (Chi-Square Distribution) 

A random variable X is said to follow the chi-square distribution with parameter 
k € N (called the number of degrees of freedom), denoted X ~ rae if it holds 
that X ~ Gamma(k/2, 1/2). In other words, 


ky 


pease. mr Ey*? e7?, ifx >0 
0 


ifx <0. 


The mean, variance and moment generating function of X ~ x are given by 


(X]=k, Var[X]=2k, M(t) = (1-22) *”, he 


14 1 Regular Probability Models 


Exercise 8 Show that X ~ x3 if and only if ¥ ~ Exp(1/2). 


The continuous probability models we have encountered so far have all been 
restricted either to a bounded interval or to the positive reals. In many phenomena, 
we expect that the random variable can assume any positive value, but its distribution 
is centred at (and is symmetric about) a centre location . The parameter wu 
represents the “location” or the value around which we expect typical realisations 
of the random variable to lie. Further to the location, there is typically a “scale” 
parameter, say 0”, which expresses how concentrated or diffuse the distribution is 
around the centre. A broad such family of models is the so-called location-scale 
family of models. Among location-scale models, the most important and well- 
studied, and perhaps the most widely applicable is the normal distribution, also 
referred to as the Gaussian distribution. 


Definition 1.17 (Normal Distribution) 

A random variable X is said to follow the normal distribution with parameters 
ye € Rando” > 0 (the mean and variance parameters, respectively), denoted 
X ~ N(, 07), if 


; ol 1 sx-py?2 
fit: 1.0%) = —— exp} 5 ( Me xeR. 


oO 


The mean, variance and moment generating function of X ~ N(, 07) are given 
by 


i[X] = pb, Var[X] = 07, M(t) = exp{tu + t?07/2}. 


Remark 1.18 In the special case Z ~ N(0, 1), we use the notation g(z) = fz(z) 
and ®(z) = Fz(z), and call these the standard normal density and standard normal 
CDF, respectively. 


Example 1.19 


The normal distribution can be a very good model for a bewildering variety of phenomena. 

Intuitively, almost any phenomenon that can be thought to arise as the result of the addition of 

a large number of random variables with finite variances can be modelled via a normal distribution 

(see the Central Limit Theorem for a precise statement, Theorem 2.23 (p. 56)). In general, 

the normal distribution will be a good model for random variables with finite variance, whose 

distribution is symmetric about a certain value jz, and whose probability of being far from jz decays 
fast. 

1. Measurement error is most typically modelled as a normal random variable. Suppose that we 
are trying to measure a quantity jz, and our measurement device is imperfect, thus yielding 
measurements Y corrupted by error e. If the error is additive, then a natural probability model 
is to assert that Y = yx + e, and e ~ N(0, 07). Consequently, Y ~ N(, 07). 

2. It is well established that several random physical phenomena are distributed according to 
the normal distribution. For example, the position after time ¢ of a molecule that moves on 


1.2 Continuous Regular Models 15 


a line subject to collisions from other molecules has a normal distribution with a mean at its 
starting point and variance equal to ¢. The velocity of any particle in a one-dimensional space 
under thermodynamic equilibrium will be normally distributed. The ground state of a quantum 
harmonic oscillator will also be normally distributed. 

. The re-scaled difference between a random variable and its mean can very often be approx- 
imated by a normal distribution. Typically this depends on taking a limiting argument over 
some parameter of that random variable. This includes variables that are discrete. For example, 
we will see later that the approximation is valid in the case of a binomial distribution with a 
large number of trials, or a Poisson distribution with a large rate parameter (in both cases, after 
appropriate centering and scaling). 

. Experience shows that a wide range of phenomena in the biological sciences, when suitably 
transformed, are remarkably well approximated by the normal distribution. The same is true of 
phenomena in the social sciences, economics and finance. In most of these cases, the underlying 
effect is a central limit theorem effect (Figs. 1.7 and 1.8). 


Oo 
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Fig. 1.7 Gamma probability density functions for different values of the parameters r and A 
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Fig. 1.8 Normal probability density functions for different values of the parameters jz and o” 


1.3. Exponential Families of Distributions 


Though it may not be immediately obvious at first sight, many of the models 
we considered earlier—whether discrete or continuous—have some important 
similarities in terms of their structure and their properties. For this reason, we will 
introduce in this paragraph an additional level of abstraction, and consider most of 
the previous models as special cases of a broader family of probability models called 
the exponential family of distributions. The advantage of such an approach is that, 
once we have this more abstract definition, any properties we prove for the general 
case will immediately be inherited by all the special cases. Here is the definition: 


Definition 1.20 (The Exponential Family of Distributions) 
A regular probability distribution is said to be a member of a k-parameter 
exponential family, if its density (or frequency) admits the representation 


k 
f(x) = exp) 0b: TiO) — VG. G+ SO, rex, (1) 


i=1 
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where: 

1. @ = (G,..., x) is a k-dimensional parameter in R*; 

2.7 :X > Ri = 1,...,k, S(x): X > R, andy : R* > R are real-valued 
functions; 

3. The sample space ¥ does not depend on ¢. 


Remark 1.21 The parameter ¢ is called the natural parameter. 


Remark 1.22 The fact that there is an exponential in the formula (1.1) is in 
itself not the most important structural property of an exponential family (since 
any density function can be written as f(x) = exp{log f(x)} on its support). The 
important property is that the density can be factorised into three parts: one that only 
depends on ¢, i.e. exp{—y(¢)}; one that only depends on x, i.e. exp{S(x)}; and one 
that depends on both ¢ and x but in a very special way: as a linear combination of 
the coordinates of ¢ with coefficients that are functions of x. 


Remark 1.23 The exponential family of distributions should not be confused 
with the exponential distribution. It is unfortunate that they share such a similar 
name. To avoid confusion, we will always speak of an exponential family to 
distinguish from an exponential distribution. 


We will see that all the distributions that we have so far seen, except for the 
uniform distribution, constitute exponential families. In order to see this, we will 
need to manipulate the expressions of the corresponding densities (or frequencies) 
in order to bring them to the form given by the form given by Eq. (1.1). It will 
often happen that the usual parameter employed does not coincide with the natural 
parameter. However, it will typically be the case that ¢ = (0) for some twice 
differentiable 1-1 mapping n : © — R* (and so y(¢) = y(n(0)) = d(@), for 
d = yon). In this form, the exponential family density/frequency will take the 
form: 


k k 
0} Dati ~y(@) + seo} = 00} Domne —d(8) + S(x) . 


i=l i=1 


Either formulation can be used, depending on which is most convenient in a 
specific context: for the purpose of doing theory and proving general results, 
the natural representation (also called natural parametrisation) given by 


exp ae $;T;(x) —y(@) + S (x)! is more convenient.! In most practical settings, 


'The reason for this is that in the natural representation, the parameter appears linearly in the 
exponent. In the usual representation, the parameter appears nonlinearly, as the image through 
the function 7. This complicates things when we will need to differentiate with respect to the 
parameter. 
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problems are presented in such a way that the parameter of interest is the 6 
parameter from the usual representation (also called usual parametrisation) given 


by exp ie ni (O)T; (x) — d(@) + S(x)}. Generally, thus, the strategy is to prove 
any necessary theorems in the natural representation, and then translate them into 
results for the usual representation. 


Example 1.24 (Binomial Exponential Family) 


Let X ~ Binom(n, p). Recall that this means that ¥ = {0,1,2,..., n} and f(x;p) = 


(") p (1 — p)" *. Now, we may take the log and then exponentiate to obtain: 
x 


(") apr = ep} oe ( E )s + nlog(1 — p) + log (")t : 
x l—p x 


6 =108(2-), T(x) =x, s(x) = toe"). v($) = nlog(1+e*) = —n log(1—p). 


Define: 


Thus, if 7 is held fixed and only p is allowed to vary, the support of f does not depend on @ 
and so we see that the Binomial with fixed n is a 1-parameter exponential family. Here the usual 
parameter p is a twice differentiable bijection of the natural parameter ¢: 


ef 7) 
Pe 7s oe & 6 =n(p) = Woe (2). 


Here p € (0,1) butd ER. O 


Example 1.25 (Counterexample: Uniform Distribution) 

Let X ~ Unif(6), 62). Notice that f(x; 61, 02) is positive if and only if x € [6), 6]. Therefore 
the support of # depends on the parameter, and thus the uniform distribution is not an exponential 
family. Notice, though, that if we fix 6; and 62 and consider the specific fixed density (rather than 


a whole family as @; and @ vary), then we do have an exponential family form, albeit a degenerate 
one with a single member. O 


Example 1.26 (Gaussian Exponential Family) 


Let X ~ N(, 07). Then we may write: 


fine= coal LSE) 


2 
= 2,h,_! 2 b 
= exp | 5 a a 5 log(2x0~) — £ : 
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Define: 
n=5. hm a Tix=x, Te(x)=x?, S(x)=0, y(br,42) f+ 5 oe ( =). 


and also observe that the support of f is always R, regardless of the parameter values. It follows 
that the N(j1, 07) distribution is a 2-parameter exponential family. O 


Exercise 9 (More Exponential Families) Show that the following distributions 
constitute exponential families (perhaps when one of their parameters is held 
fixed): 

. The Poisson distribution. 

. The geometric distribution. 

. The negative binomial distribution. 

. The exponential distribution. 

. The gamma distribution. 

. The chi-square distribution. 


NnBWN 


There are several more probability models that form exponential families. 
Though we have not studied them here explicitly, it is worth mentioning them: 
the Pareto distribution, the Weibull distribution, the Laplace distribution, the chi- 
squared distribution, the lognormal distribution, the inverse Gaussian distribution, 
the inverse gamma distribution, the normal-gamma distribution and the beta distri- 
bution, among others. 

Later, we will prove some key theorems on estimation and hypothesis testing for 
exponential families; and these results will then be valid for any specific exponential 
family. 


1.4 Transforming Probability Models 


It is often the case that we have a model for a particular random phenomenon 
whose outcome is described by a random variable X, but we are really interested in 
modelling some aspect of this phenomenon, say g(X ), where g is a known function. 


Example 1.27 


Suppose that R is a positive random variable denoting the radius of coverage of a wireless antenna. 
Assume that R ~ Unifla, b], for some 0 < a < b. What is the distribution of the area of coverage, 
A=7R?? Oo 


The purpose of this section is to investigate what the distribution of g(X) is, 
given knowledge of the distribution of X; in other words, how the distribution of 
a random variable X is transformed, when the random variable X is transformed. 
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In the discrete case, things are relatively straightforward (though they rarely give 
simple closed form expressions for the resulting distributions). 


Lemma 1.28 Let X be a discrete random variable, and Y = g(X). Then, the 
sample space of Y is Y = g(&) and 


Fy(y) = Plg(X) < y] = Do fx@1fg(@@) < y}, = Vy ey (1.2) 


xEX 


fry) = Ple(X) = y) = Do fe @llg@) =}, Vy ey. (1.3) 


xEX 


Proof It suffices to observe that P[Y = y] = orhataay P[X =x],Vy ey. 
oO 


In the case where X is continuous, things are a bit more subtle to state and prove: 
the obtention of general formulas is not possible for non-bijective g. If g is not a 
bijection, the problem has to be attacked by direct methods that are specific to the 
setup: 


Example 1.29 (Squared Standard Normal Has er Distribution) 


Let Z ~ N(0, 1). We would like to find the distribution of Y = Z?. Note that Fy(y) = P[Y < 
y] = Oif y < 0. For y > 0 we have 


Fy(y) = P[Z? < y] =PIIZ| < Vy] 


=PL-JVy =Z< Jy] = (Jy) — O(—/¥) = (/Y) — (1 — 8/9) 
= 20(,/y) — 1. 


We can also find the density by differentiating: 


fr) = 2 OD) = 27 ROW VT 
= yi? _ 1 —y/2 yl? 
= 26(/Y) = Tae 5 
_ 1 ewe y-M2 = 1 yl2-le-y/2 
Jide 2PE(/D” 
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Notice that the last expression is the density of the x7 distribution (see Definition 1.16, p. 13). We 
therefore have 


Z~N(O,1) => 2? ~ Xi. (1.4) 


Oo 


On the other hand, if g is a monotone differentiable transformation, then we may 
derive general explicit (closed form) expressions for the distribution and density of 


g(x). 


Lemma 1.30 Let X be a continuous random variable on X C R and let g : 
X — R be monotone and differentiable, with derivative positive on X. Let Y = 
g(X). Then, the sample space of Y is Y = g(X) and 

* if g is increasing, then Fy(y) = Fy(g~'(y)), Vy € Y, 

* if g is decreasing, then Fy(y) = 1— Fy(g7!(y)), Vy € ¥. 

In either case, we will have 


0 
fro) = oy] frig '@)), Vy ey. 


Proof Assume initially that g’ is positive everywhere on X (g is monotone 
increasing). This means that x < y = > g(x) < g(y). Then, for y € J, 


Fy (vy) = Plg(X) < y] = PIX <2 '()] = Fr(g '0)). 


Therefore, 


frO)= ah (y)= Pe (¢-'Q)) = flO) 810) = = fx(g'(y)) Feo). 


with the last equality following from the fact that g’ is everywhere positive. 
Now consider the case where g is monotone decreasing (and so g’ is negative 
everywhere). This means that x < y <= > g(x) > g(y). Then, for y € JY, 


1— Fy(v) = Plg(X) > y] = PIX < g 10) = Fr(e' (0) - PIX = 8 '0)).- 


=0 


But fy(y) = ri Fy (y)). Therefore, 


9 a Qj 
BO eave i602) oe — f(g ON F810) = fx(g'(y)) ae 
y y y 


since —g’ is everywhere negative. This completes the proof. oO 
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Exercise 10 (Log-Normal Distribution) Let X ~ N(j1,07), and show that the 
density of Y = e* is given by 


I —(In y — p) 
= , O<y<aw. 
ty) yoV/2n exp ( 202 y 0° 


The distribution of Y is called the log-normal distribution. 


Exercise 11 (Random Number Generation) Let Y ~ Unif(0, 1) and let F bea 
distribution function. Prove that the distribution function of the random variable 
X = F7!(Y) is given precisely by F, where we define F~'(y) = inf{t € 
R : F(t) = y} (see Definition A.6, p. 161). Observe that with this result, we 
can generate realisations from any distribution, provided that we can generate 
realisations from the uniform distribution. 


An easy corollary to the last two lemmas combined is the following: 


Corollary 1.31 (Affine Transformations) Let X be a random variable and 
Y = g(X). If g(x) = ax +b, a £0, then 


yob 


lye (| 
Vy ey, Fy(y) = 1— Fy (2) +P(x = 2) a<0, 


a> 0, 


with P (x = 1) = 0 when X is a continuous random variable. Thus, for 
veo. 
-1 =O, : : 
1. fy(y) = la | fx | —— ], f X is continuous, 
a 


2. fry) = fx (=), if X is discrete. 


An important special case is that of the behavior of aX + b when X ~ N(1,07). 


Lemma 1.32 (Affine Transformations of Normal Distributions) Let X ~ 
N(u,07), a # 0. Then aX +b ~ N(ayu + b,a?o7). Consequently, if X ~ 
N(, 07), then 


F(z) = @(=—), 


where ® is the standard normal CDF, ®(u) = Nhiee (21)~'/? exp{—z?/2}dz, that 
is, the distribution function of a random variable Z ~ N(0, 1). 
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Exercise 12 Prove Lemma 1.32. 


This last result is particularly important because it allows us to calculate proba- 
bilities associated with normal random variables. The problem is that the integral 


7 = exp{—(x — y)*/207}dx cannot be explicitly solved, and so one would 


28 94/2 
need to tabulate probabilities for all combinations of and o (an impossible 
task). The last result tells us, however, that we only need to tabulate the standard 
normal CDF, ®, and calculate probabilities by linear transformation. The process 
of subtracting the mean and then dividing by the standard deviation is called 
standardisation. 
As a final result in this section, we state a theorem giving a general formula for 
the joint density of a bijective transformation of a collection of multiple random 
variables. 


Theorem 1.33 (Multidimensional Transformations) Let g : R” — R” bea 
continuously differentiable injection, 


g(x) = (g1(*),.-., &n(*)), ei eR 


LG = (OGisseee ea be a random vector with joint density fx (x), x € R", 
and define Y = (Y,,...,Y,)' = g(X). Then, if Y" = g(X"), we have 


fulv) = fxg) det [Jp], fory = Or...) ED", 


and zero otherwise, provided that J,-1(y) is well defined. Here, J,-1(y) is the 
Jacobian of g™", i.e. the n x n-matrix-valued function, 


are, (ico ee 
J g-1 (y) a : 7 : 
ecg ose me 


Exercise 13 Use the integration by substitution formula to prove the theorem. 


Proposition 1.33 can sometimes be used in a clever way, even if the transforma- 
tion involved is not invertible: it suffices to “augment” the transformation, as in the 
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corollary that follows: 


Corollary 1.34 (Convolution) Let X and Y be independent continuous random 
variables with densities fx and fy. Then, the density of X + Y is given by the 
convolution of fy with fy: 


+00 


Sx+y(u) = fx(u—v) fy (v)dv. 


Proof To see this, define 


& 
g:R oR, (ye (xt+y,y) 


with inverse mapping 
g 
(u,v) > (u—v,v). 
The Jacobian of the inverse can be easily seen to be 
1 0 
-11 


whose absolute determinant is equal to 1. It follows from the multivariate transfor- 
mation formula that 


Sysyy(u,v) = fxy(u—v,v) = fy(u—v) fy(v), 


where we have used the independence of X and Y. Integrating with respect to v 
now yields the marginal density fy+y, 


+00 
fxtyW = / fx(u—v) fy (v)dv. 
Oo 


We conclude this section with an immediate application of the last corollary, 
concerning sums of normal random variables. 


1.4 Transforming Probability Models 25 


Corollary 1.35 (Sums of Independent Normal Random Variables) Let 
X1,...,Xy be independent random variables such that X; ~ N(i, On), and 
lec S; = Sis, 37. hen, 


S, ~ N (Som. So07). 


i=1 i=1 


— 


Proof It is clear that E[S,] = }~/_, 4;, so that we may assume that ju; = 0, and 
show that in this case S, ~ N(0,o7 +--+: + 0,2). We proceed by induction, starting 
with n = 2. For tidiness, write 0? = OT + O35 . Then, by Corollary 1.34, we have 


+00 


Sx, +x,(u) = Sx(u—v) fy(v)du 


dv. 


a 1 ea 
exp )— 


oo 01022 20; Oy 


Completing the square, we have 


osu? + oF" — 2asuv + of v* =o0su’ + of" — 2ozuv 


+07’ + of0 uv? — of0 Ww 


= (05 — 030°”) Ww + (ov- o3071u)” 


2,2 3.9. 2 oR) 2 _ x2,-1,,\7 
_ osu? + a3? — 2ozuv + of v ues (ov —a307'u) 


2 252 
Jora2 20 20; 05 


Hence, with the change of variables w = ov, we have 


—_—— aaa (ov - o3071w)” F 
v 
co 0102/20 2ofo3 


1 uw [-—. (w — aio (w= o}07!u)” 4 
—— exp } ———~ w 
oV2n P 207 co 0102V2 20703 


since the integrand is the density of a Gaussian distribution with mean o0~!u and 
variance 0703. In summary, we have 


uw 


fxitx)(u) = ao |-55 


Syj4x,(u) — 


which is the density of a N(0, 0”) distribution. 
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For the induction step, suppose that we have proven that S; ~ N(0,0/+-: +07), 
and wish to prove that Sx; ~ N(O,a7 +--+ + 07, ,). Since 


Skat = Se + Xe41 


is the sum of a N(0, 07 +---+ O;) with an independent N(0, Op 1) random variable, 
the first part of the proof shows that indeed S;,4; ~ N(0, o; peer tH 744): and the 
proof is complete. oO 
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In the sequel, we will typically assume that a specific type of probability model 
has already been selected as a description of a random phenomenon, and will 
proceed in developing our theory taking this model as given. But before we do 
so, we must at least pause for a short moment and consider how or why such a 
model was selected in the first place. In other words, why does it make sense to 
assume that the exponential distribution is a good model for the waiting time until 
the emission of a radioactive particle, or the Poisson distribution in order to model 
the number of bacteria in a water tank? In very broad terms, we can say that the 
selection of a probability model could be based upon: (1) scientific theory and prior 
experimentation; (2) philosophical principles; (3) exploratory data analysis; (4) a 
combination of (1), (2) and (3). 

The ideal situation is one where the modeler may choose a probability model 
as a consequence of a well-founded scientific theory or overwhelming empirical 
evidence. This is often the case in random phenomena that occur in the physical 
sciences, most commonly in physics, as a result of physical laws and/or experi- 
ments. These laws may suggest that the random phenomenon must satisfy certain 
conditions and/or possess some properties. If we are fortunate enough, we may 
have enough properties and conditions to uniquely determine a suitable probability 
model. There is much knowledge on whether or not a certain list of properties 
uniquely specifies a certain probability model in the field of characterisation of 
probability models. 


Example 1.36 (Exponential Distribution for Emission Time) 


Scientific theory suggests that it is impossible to predict how long it will take until an unstable 
nucleus decays. This time is a random variable 7. In fact, the random process is such that even if 
a certain amount of time has elapsed and we have not yet seen a decay, this does not give us any 
information at all on how much longer we might still have to wait. In mathematical terms: 


P[T >t+s|T >t]=P[T > s]. 
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We know that the exponential distribution {(t) = Ae~?"1{t > 0} has this property. In fact, it can 
be proven that this is the only distribution supported on [0, oo) that has this property, thus dictating 
its choice in order to model radioactive particle emission times. O 


Exercise 14 Prove that the lack of memory property characterises the exponential. 
More precisely, let X be a random variable such that P(X > 0) > 0 and 


P(X >t+s|X >t)=P(X > 5), Vt,s > 0. 


Prove that there exists a A > 0 such that ¥ ~ Exp(A). 

Hint: Let G(t) = P(X > ft). Show that the lack of memory property implies that 
G(t +s) = G(t)G(s) for t,s > 0.Then, define g(t) = —InG(t) andA = g(1). 
Show that g(t) = tA for all t > 0 rational. Deduce that g(t) = ta for allt > 0. 
What is the sign of A? Finally, show that A < oo using the fact that G(0) > 0 and 
continuity from the right of G. 


It may seem that perfect characterisation of probability models is likely to occur 
only in relatively simple phenomena. This is not necessarily the case. Very often, we 
can build more and more complex models by combining several different constraints 
(stemming from theory or experiment), partial characterisations, approximations 
and mathematical manipulation. We will not consider here more elaborate examples, 
but will mention that Einstein’s model for the movement of a particle in a gas or a 
liquid (the famous Brownian motion) can be developed by such means. 

Other times, even if we impose all the necessary conditions, we cannot uniquely 
determine a probability model. In other words, there are several candidate proba- 
bility models that would respect the conditions imposed by scientific theory and 
experiment. If we have no other source of information or no other evidence to help 
us choose a model, then we might have to choose one by means of some sort of 
principle or postulate, for example a philosophical/epistemological principle. 


Example 1.37 (Entropy) 


Suppose that we wish to model a natural phenomenon whose outcome is described by a continuous 
random variable X taking values on a given XY C R. Assume that scientific theory dictates that 
the phenomenon should satisfy certain properties on average, in the sense that the expectations of 
certain functions of X should be fixed: 


UT (Mj=o;, i=1,....k. 


If there are several probability densities f under which X that would satisfy these expectation 
constraints, the philosophical principle of entropy dictates that among these we should prefer the 
model that maximises the entropy of X, 


H(f)=- I, log fo) f(x)dx. 
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The entropy of f is a measure of how “unpredictable” a random variable that follows f is. If we 
choose a density f that has low entropy, then we are in essence imposing a more“predictable” 
behaviour on X, a behaviour that is more favourable to us in terms of how easy it is to predict X. 
If we know nothing beyond our constraints, however, we do not wish to artificially impose such a 
simplification. We must therefore choose the worst case scenario, i.e. the most unpredictable model 
possible: the one that maximises the entropy. 

A very interesting result says the following: if a maximiser of the entropy subject to the k 
expectation constraints exists, then it must be a k-parameter exponential family (in fact, the 7; 
that appear in the expectation constraints will also appear in the formula for the density of the 
specific exponential family). This explains why the exponential family features so prominently 
in probability models, and why the members of the exponential family form the fundamental 
examples used in much of statistics. O 


Example 1.38 (Parsimony) 


If we are given two different probability models f(-;@) and g(-; 1), depending on multi- 
dimensional parameters @ and w, respectively, both of which would satisfy equally well all the 
constraints and conditions that the random phenomenon should satisfy, choose the one that depends 
on the least effective number of parameters. For example, if @ can range in some d-dimensional 
set and w can range in some d’-dimensional set, with d’ < d, we choose g over f. The 
principle of parsimony rests upon the idea that given different models that are adequate for the 
same phenomenon, we should choose the one that is least complex. O 


Still, there may be situations where a probability model cannot be unequivocally 
selected by means of physical laws and/or scientific principles, or where we are 
simply not willing to make a choice solely on the basis of a principle. In this case, 
we may seek out empirical evidence in order to supplement our principled choice 
of model, or in order to validate a model. For example, we may have observed n 
independent realisations of the random variable X . By looking at the characteristics 
of these n values we might be able to suggest a model that would appear fitting 
to the form of the data, or at minimum be able to rule out some models whose 
characteristics would be incompatible with what has been observed. The process 
of investigating patterns in the observed data in order to select an appropriate 
probability model is called exploratory data analysis. 


1.5.1 Exploratory Data Analysis 


Let x1,...,X, be a data set comprised of n real values. These values constitute 
the realisation of m independent and identically distributed random variables 
X1,...,Xn Whose probability distribution has a density/frequency function f 


which is unknown to us. Worse, still, we do not even know what class of distri- 
butions f belongs to. In order to be able to select an appropriate probability model, 
exploratory data analysis considers various graphical representations and numerical 
summaries of the data x;,...,x, that will allow us to gain an appreciation of the 
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general form of f, along with some basic characteristics, that will hopefully guide 

our model choice. 

What are some basic aspects of the form of a probability distribution that we can 
try to look for? Here are some of the most important characteristics that one ought 
to take into consideration: 

1. Location. The location of a distribution is generally understood to be a point on 
the real line representing some centre of the distribution. The notion of a centre is 
a vague concept that can be made precise in several different ways. For example, 
it can be understood as a centre of mass (the mean, 4 = E[X]), as a global 
maximum (the mode, arg sup,< /(x)), or a point that splits the probability mass 
in half (the median, m = inf{x : F(x) > 1/2}). Notice that a location may not 
always be uniquely defined: though the mean is unique (when it exists), the mode 
may not be (e.g. think of a distribution with two peaks of equal height). 

2. Dispersion. The dispersion of a distribution is a measure of how concentrated 
or diffuse the distribution is. Similarly to location, it can be formalised by 
several different measures. Often one measures dispersion by quantifying how 
concentrated the distribution is around a measure of location. For example, the 
variance E[(X — j)*] is a classical measure of dispersion, that measures the 
second moment of inertia of the distribution around the mean. It is not the only 
one, though; for example, one may consider the mean absolute deviation (MAD), 

‘[|X — |], where 7 = E[X]. Or further yet, one may consider a measure of 
dispersion that does not make explicit reference to the centre of a distribution. 
For example, the interquartile range is defined as IQR = inf{x : F(x) = 
3/4} —inf{x : F(x) = 1/4}; roughly speaking, it measures the length of the 
most central interval supporting 50 % of the mass of the distribution. 

3. Symmetry/Skewness. A density/frequency f is symmetric about a point xo if 
F(xo — x) = f(xo + x) for all x € &X. A distribution may be symmetric, 
mildly asymmetric (mildly skew) or strongly asymmetric (strongly skew). One 
may measure the asymmetry of a distribution through the notion of skewness, 


3 
which is defined as: E () i where wu = E[X] ando = /Var[X]. Ifa 


distribution is symmetric, then its skewness must be zero. When the skewness is 
positive, we speak of a right-skew distribution (respectively, negative skewness 
yields a left-skew distribution). 

4. Tail Behaviour. The tails of a distribution are the values taken by its den- 
sity/frequency f(x) as x — too. Notice that since f is always positive and 
integrates/sums to 1, it must be that limy— 5 P[|X| => x] = 0. The rate of decay 
of P[|X| => x] as x — oo is what determines its so-called tail behaviour. A light- 
tailed distribution has a fast rate of decay (for example, exponential), and a heavy 
tailed distribution has a slow rate of decay (for example, polynomial). A heavy 
tailed distribution is such that the probability of observing an extreme value is 
non-negligible. It might be that both the left and the right tails of a distribution 
are heavy, but it might also be that only one of these two tails is heavy (Fig. 1.9). 
If a candidate probability model for a random variable X does not share similar 

location/dispersion/symmetry/tail properties as those observed for X, then it is not 
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(c) (d) 


Fig. 1.9 Illustration of the notions of location, dispersion, skewness, and light/heavy tails. (a) Two 
densities differing in location. (b) Two densities differing in dispersion. (c) Two densities differing 
both in location and in dispersion. (d) Two asymmetric densities: one with positive skewness (red), 
and one with negative skewness (blue). (e) A heavy tailed density (red) and a light tailed density 
(blue). (f) Plots of the mapping x F> f bs Ft (y)dy for the two densities on the left (e) 
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a good model for the phenomenon described by X. What do we mean by “those 
observed for X”? We mean that we can use the sample values x;,...,X, in order 
to gain some appreciation of these properties. We will do so quantitatively (using 
numerical summaries) and qualitatively (using graphical summaries). 


1.5.1.1 Numerical Summaries 

We first introduce some useful notation: if x;,..., x, aren real values, we denote by 
X,;) the jth sample value, when these are ordered in increasing ordered (so x(1) = 
min{x),...,X,} and xi) = max{x),...,X,}). Notice that this means that 


Xa) SX) S--- S Xn-1t) S X(n)- 


To illustrate the notation, say that n = 4 and we have x, = 5, x2 = 12, x3 = 2, and 
X4 = 12. Then we write xq) = 2, x2) = 5, and x3) = x4) = 12. So, in this case, 
X(1) = X3, X(2) = X1, X33) = X(4) = X2 = X4. 

With this notation under our belt, we begin by defining two numerical summaries 
of the sample that can be used in order to gauge the location of the sample. 


Definition 1.39 (Sample Mean and Median) 
Let x1,...,X, be a collection of real numbers, called a sample. We define: 
1. The sample mean as 


SI 


slR 


n 
) Xj. 
i=l 


2. The sample median as 


X(a$1) if n is odd, 


% +x 
4 my . 
(8) GH) otherwise. 


Both of these characteristics have merits and drawbacks as descriptors of 
location. The mean takes into account the magnitude of each observation when 
determining location, and can be seen as the barycentre of the sample values.” 
However, it can be strongly affected by the presence of a single very large (or 
very small) value, which might distort the representativeness of the mean as a good 
descriptor of location. On the other hand, the median does not take into account 


2That is, if we took the line segment Xi) — X(1) and placed equal weights at the points x;,..., Xn, 
then the point x is where the line segment would balance. 
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the precise value of the observations, but simply their ordering, and can be seen as 
the “middle” observation? (or the average of the two middle observations, when the 
sample size is even). In this sense, it is a cruder indicator of location. This can also 
be an advantage, though: the median will not be sensitive to the presence of very 
large (or very small) observations, since it only takes their ordering (and not their 
magnitude) into account. 


Exercise 15 1. Calculate the mean x and the median M of the following data set: 


92 WS 9.7 110 8.5 
98 10.0 12.1 105 10.1 


2. Repeat your calculation when the observation 12.1 is replaced by 48.6. 
3. Compare the values of x and M in part 1 and part 2. What do you observe? 


Exercise 16 Show that 
1. The function f(y) = 77_, (x; — y)? has a unique minimum at x. 


2. The function g(y) = )°’_, |xi — y| is minimised at M. Warning: g is not 
differentiable at y whenever y = x; fori = 1,...,n. 


Next, we consider several numerical summaries that can be used in order to 
ascertain how disperse the underlying distribution might be on the basis of the 
sample values X1,...,Xn- 


Definition 1.40 (Sample Variance and MAD) 
Let x1,...,X, be a collection of real numbers, called a sample. We define: 
1. The sample variance as 


1 n 
a2 =\2 
@=-) xj —X 
i ( ) 


i=] 


(the sample standard deviation is defined as G = VG?). 
2. The sample MAD as 


Exercise 17 Show that we may also write 6? = + )~_, x? —X?. Comment on why 


n i=1 
this formula may be more useful. 


3In the sense that half the observations must be greater than or equal to the median, and half the 
observations must be less than or equal to the median. 
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The sample variance expresses how concentrated or spread out the observations 
are relative to their sample mean. From a physics point of view, it represents the 
second moment of inertia around the mean.* As was the case with the sample mean, 
the sample variance can also be substantially inflated when there is a single extreme 
observation in the sample. This will create an impression of much higher dispersion, 
when in fact the sample may be fairly well concentrated, with the exception of a 
single rogue observation. The MAD, on the other hand, is somewhat less affected in 
such circumstances, since it is formed by summing absolute distances, rather than 
squared distances (the square would disproportionally inflate the contribution of an 
extreme observation to the sum). One can show that when there are no extreme 
observations, the variance is a better indicator of dispersion; in the presence of 
extreme observations, the MAD is preferred. How can we judge which observations 
are extreme? The pertinent notion is that of outliers, whose presence is in fact an 
indicator of heavy tails. 


Definition 1.41 (Quartiles, IQR and Outliers) 
Let x1,...,X, be a sample of 7 real values, and let 


KA)n22 2, My. 4 Xn) 


be the ordered sample, where M is the median. We define: 

. The first quartile, Q;, as the median of the ordered sub-sample x(1), X(2),...,M. 

. The second quartile, Q2 as being the median M, 0, = M. 

3. The third quartile, Q3, as the median of the ordered sub-sample M,..., X(n—1), 
X(n)- 

4. The inter quartile range (IQR) as IQR = O3-— Q). 

5. An outlier as an observation falling outside the interval [Q. - 31QR, Q3+ 
IQR]. 


Ne 


Just as the median can be interpreted as the “middle” observation, the first 
quartile can be seen as the “first quarter’ observation (and the third quartile can 
be seen as the “third quarter” observation®). Half of the sample observations lie 
within the interval [Q1, Q3]. In some sense, the interval [Q), Q3] is the most central 
interval containing 50 % of the observations. The length of this interval, the IQR, 
can also be used as an indicator of dispersion. This length reflects how spread out 
the central portion of the sample is. Finally, the notions of quartiles and IQR can 
be used in order to define what would qualify as an “extreme” observation (an 


4That is, if we took the line segment X(,) — X(1) and placed equal weights at the points x;,...,. Xie 
then tried to rotate the segment around the point x, then the variance is an indicator of how much 
force we would need to apply. If the observations are spread far from x, then we need a lot of force 
(high sample variance); but if the observations are close to x, then our task is easier (low sample 
variance). 


>To be precise: 25% of the sample observations are less than or equal to Q;, and 25 % of the 
observations are greater than or equal to Q3. 
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outlier). In some sense, extreme observations are removed from the bulk of the other 
observations. The definition of an outlier may seem somewhat arbitrary, but there 
are deeper mathematical reasons that support this definition. 


Exercise 18 Let x,..., x, be a sample. What are the median M and quartiles QO; 
and Q3 whenn = 12, 13, 14 or 15? A more tedious generalisation: find the general 
formulae (for 1 arbitrary) for the first and third quartile, Q; and Q3. Hint: these 
formulae are of the form 


? n=O mod 4 
? n=1 mod4 
? n=2 mod4 
? n=3 mod 4. 


We conclude our brief discussion of numerical summaries by considering a 
measure of asymmetry: the sample skewness. 


Definition 1.42 (Sample Skewness) 
Let x1,...,X, be asample of n real values. We define the skewness of this sample 
as 


SK = 1 a1 i — x) 


a peerie:, _ xp)? : 


If both the numerator and denominator are equal to zero (which can occur in 
discrete samples), then SK is undefined. 


As was earlier discussed, one can look at whether SK is positive, negative, or 
close to zero, in order to judge whether the distribution generating the sample had 
aright or left asymmetry, or was indeed symmetric. A drawback is that the sample 
skewness may not be a good proxy for the true skewness of the distribution, and 
defining good bounds on “how large” the skewness should be in order to declare 
that the distribution is asymmetric is a subtle problem that requires methods from 
later chapters. Instead of embarking on such a project at this point, we turn to the 
use of graphical summaries, which will allow us to obtain an intuitive appreciation 
of the asymmetry in the data, without needing to resort to elaborate calculations. 


1.5.1.2 Graphical Summaries 

We now turn to two simple graphical representations of the sample x,,..., Xn 
that can help us visualise the form of the underlying density/frequency f. The 
histogram and the boxplot. A histogram is a proxy for the unknown density built 
out of the observed sample values x),...,X,. The idea is simple: if there are many 
observations falling in some interval /, then the density should be relatively high on 


1.5 Model Selection and Exploratory Data Analysis 35 


that interval. Therefore, if we partition the x-axis into disjoint intervals, and define 
a step function that is constant over these intervals (and such that the height of each 
step is proportional to the percentage of observations lying in the corresponding 
interval), we will have constructed a step function approximation to the unknown 
density. 


Definition 1.43 (Histogram) 
Let x,,...,X, be a collection of n real values and h > 0 be a constant. Let 
{1;};ez be a regular partition of R comprised of intervals of length h > 0, 


1) = |e +(7-Dh.w + jh), J €Z, 


where xk € R is some fixed real number. The histogram of x),..., x, with bin 
width h > 0 and origin x is defined to be the graph of the function: 


1 1 n 
Yor hist) = gD MY € LD Mar € Tj}. 


jeZ i=l 


Notice that the histogram is indeed a reasonable step function approximation 
of f: by its definition, the function hist,,..,(97) takes non-negative values only, 
and the integral of the function hist,,._.., (v) is equal to 1. In addition, the integral 
_ x, (¥) over an interval J; gives us the proportion of sample values that 
fell inside J;. It therefore has the properties of a probability density function. 
Furthermore, 


; oi n | oa 
y i histy,..... 0 = >> PIX € I] [ S(y)dy. 


J i=1 


In this sense, the histogram is some sort of Riemann-sum-proxy of the density 
jf, constructed using the values of the sample. It can be used in order to gauge 
properties such as location, dispersion, symmetry and tail behaviour via a visual 
inspection. 


Remark 1.44 (Bin Width) Depending on the choice of / a histogram may be 
more or less informative about the structure of the sample at hand. Consider the two 
extremes, h — 0 and h — on. In the first case, the intervals eventually become 
so short that any interval contains either no observations or a single observation, 
thus simply highlighting where each observation lies on the x-axis (see Fig. 1.10e, 
p. 36). In the second case, all the observations are eventually contained in a single 
huge interval, and the histogram simply informs us that there is a large region that 
contains all observations (see Fig. 1.10f, p. 36). Reasonable values of h allow us to 
visualise the structure of the sample. In principle, the value of h should depend on 
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Density vs Histogram Density vs Histogram 


Density 
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Density vs Histogram Density vs Histogram 


Density 
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Density vs Histogram Density vs Histogram 


Fig. 1.10 Histograms for different samples (and, correspondingly, different bin widths) compared 
with the density from which the samples were drawn. (a) Density of N(0, 1) (in red) and histogram 
for a random sample of size 20 from an N(0, 1) (in black). (b) Density of N(0, 1) (in red) and 
histogram for a random sample of size 100 from an N(0, 1) (in black). (c) Density of 6) (in red) 
and histogram for a random sample of size 20 from a 73 (in black). (d) Density of 3 (in red) and 
histogram for a random sample of size 100 from a y3 (in black). (e) Density of y3 (in red) and 
histogram for a random sample of size 20 from a 73 (in black) when the bin width h is taken to be 
very small. (f) Density of x3 (in red) and histogram for a random sample of size 20 from a 73 (in 
black) when the bin width h is taken to be very large 
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the sample size n: the larger n, the smaller / needs to be; intuitively, this means 
that when we have more observations, we can try to investigate finer aspects of the 
structure of the sample x,,...,x,. The precise requirement is that we must have 


that h "—S 0 and hn "—S> oo. There is a lot of theory on what the optimal / is as 
dependent on 7, but we will not consider this here. A simple (but often suboptimal) 
choice is to take h = n~'/?. A data-dependent choice is the so-called Freedman— 
Diaconis choice of h = 2IQR x n-3, 


Remark 1.45 (Bin Centres) Notice that for any given h > 0, there are 
several possible histograms depending on the choice of «. Unfortunately, there is 
no unequivocal means of determining what the “right” « is. The analyst must either 
try several values or at minimum keep in mind that the histogram should not be 
over-interpreted, as its form may be perturbed by changes in x (e.g. because by a 
small shift in « some observations that fell in the kth interval may now fall in the 
(kK + 1)th interval, and so on). 


Histograms can be criticised for having some noticeable drawbacks. Chief among 
these is the need to choose a bin width / and an origin «. Another drawback is that 
they can sometimes become misleading if over-interpreted. For example, looking 
at the histogram in Fig. 1.10c (p. 36) we see a slight pattern of asymmetry. Is this 
to be taken as an indication that the underlying distribution is asymmetric? Not 
necessarily, since a histogram can rarely be perfectly symmetric due to sampling 
variation. The message here is that we should not try to extract finer information 
than what our graphical summary is really able to offer. Histograms may deceivingly 
appear to be interpretable in more detail than they actually are. 

A different type of graphical display that allows us to probe the location, scale, 
asymmetry and tails of a density is the boxplot. In contrast to the histogram, the 
boxplot is a much coarser description of the sample structure and does not require 
the specification of any tuning parameters. It simply marks out the points on the x- 
axis where some key numerical summaries of a sample are located. This is usually 
done in the form of a box, which explains the name of the boxplot: 


Definition 1.46 (Boxplot) 
Let x,,...,X, be acollection of n real values. Let: 

1. M be the median, Q, be the first quartile, and Q3 be the third quartile of 
{x1, see Kats 

2. W, = minj<j<n{xj i Xj = QO) — 1.5 x IQR} & Wy = maxjcjen{xj i x; < 
O03 + 1.5 x IQR}. 

3,.0= {i € {1,...,m}: x; ¢ [W,, Wo]}. 
The boxplot of x;,...,X, is an annotation of the values M, QO), 03, W,, W2, and 
{x; : J € O} on the real line. The following is a standard annotation: 
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The definition is a little difficult to visualise, but the picture says it all: we 
annotate the median (M), the first and third quartiles (QO; and Q3), and the first 
and last observation (W, and W) to fall within the interval [O; — 1.5 x IQR, 03 + 
1.5 x IQR] (these two observations are called the whiskers). Any observations falling 
outside of the whiskers are marked separately and are outliers (the {x; : j € O}). 
Since W, < QO; < M < Q3 < Wh, we usually omit the explicit annotation, since 
by their ordering it is clear which component of the boxplot denotes which value. 

The boxplot illustrates the location of the sample by means of the median. It 
also gives an indication of the underlying dispersion by presenting the quartiles QO, 
and Q3 (and their distance) as well as the whiskers (W; and W). Large distances 
between these values indicate large dispersion. Asymmetries can be probed by 
looking at the positioning of the quartiles and of the whiskers relative to the median. 
If these are located roughly symmetrically opposite of the median on either side, 
then we have a roughly symmetric structure. If the distance of one of the quartiles 
or one of the whiskers from the median is greater than that of the other, then we have 
skewness towards the side where the distance is greater. Finally a boxplot allows us 
to detect the presence of heavy tails, by looking at how many outliers there are, and 
on which tail of the distribution these are. Again, it is easiest to appreciate different 
forms of boxplots by looking at some pictures (see Fig. 1.11, p. 39). 


Exercise 19 The following data are on the maximal weight (in tons) that could be 
supported by steel cables produced at a factory: 


10.1 122 93 124 13.7 11 13.3 
10.8 116 10.12 11.2 114 #118 #71 
12.2 126 92 142 105 


1. Represent the data in a histogram with bin width h = 1 and origin « = 10. 
Construct a second histogram, this time with h = 2 and« = 11 and compare the 
two. 

2. What is the approximate weight that at least 3/4 of the cables can support? 

3. Find the third quartile. 

4. Construct a box plot. Are there any outliers to be noticed? Where does one find 
the value determined in part (2) in this diagram? 


Exercise 20 The following table contains the results of rugby matches of the 
eleventh and twelfth match days (November 2014) of the French rugby first (“Top 
14”) and second (“Pro D2”) division. The home team is always mentioned first. 
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Fig. 1.11 Three boxplots corresponding to three different samples. In each case, the ticks on the 
axis below the boxplot represent the actual sample values from which the boxplot was constructed. 
Some noticeable aspects of the three samples based on the boxplots are: the first sample seems to 
present a high degree of symmetry. Both of the remaining samples show a clear asymmetry, and 
they are both skewed to the right (positive skewness). The third sample appears to present heavy 
right tails, as indicated by the presence of multiple outliers 


Top 14 D2 
Montpellier—Brive 10-25 Albi-Agen 22-9 
Castres—Toulon 22-14 Béziers—Aurillac 14-19 
Clermont-Stade Frangais 51-9 Colomiers—Pau 50-10 
Grenoble—Lyon 34-30 Montauban-Tarbes 31-13 
Oyonnax—La Rochelle 37-9 Biarritz—Massy 21-3 
Racing Métro—Bayonne 27-10 Dax—Narbonne 12-3 
Bordeaux Bégles—Toulouse 20-21 Perpignan—Bourgoin 42-0 
Carcassonne—Mont-de-Marsan 17-28 
Toulon—Clermont 27-19 Biarritz—Agen 42-18 
Castres—Racing Métro 9-14 Albi—Carcassonne 34-22 
La Rochelle—Bayonne 19-19 Aurillac—Colomiers 20-13 
Lyon—Montpellier 23-20 Bourgoin—Montauban 14-20 
Oyonnax—Bordeaux Bégles 28-23 Massy—Dax 50-13 
Toulouse—Grenoble 22-25 Mont-de-Marsan-Béziers 32-18 
Stade Frangais—Brive 20-17 Narbonne-Tarbes 36-23 
Pau—Perpignan 22-19 
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1. We wish to compare the performance of the teams in the first and second division. 
To this aim, calculate the pertinent statistics (mean, median, quartiles, EQR, etc.) 
for the score difference, as well as for the total points scored in each match, for 
each of the two division. 

2. Construct box plots, and juxtapose them, for the sum and difference of points in 
each division, respectively. What conclusions can we draw? 


Sampling from Probability Distributions 


As mentioned in the introduction, statistical inference deals with the problem of 

making inferences from data in the presence of uncertainty. The mathematical 

framework for this endeavour is provided by probability models. At a general level, 
an inferential task can be cast as: 

1. A random phenomenon X is assumed to be described by a regular parametric 
probability model {Fg : 6 € ©}. The functional form of each Fy is completely 
known, for any value of the parameter 6 € © C R?. 

2. We observe a sample from a specific version of this probability model. That is, we 
observe n independent and identically distributed realisations X;,..., X, having 
distribution F(x; 0), for some 6 € ©. Though we know that our observations 
stem from a version of the parametric regular model, we do not know the precise 
@ that generated the data (i.e. we know the model, but we do not know which 
member of the model generated the data). 

3. We wish to use the sample (X,,..., X,,) at hand in order to make statements 
about the true value of @ that generated it, and quantify the uncertainty attached 
to those statements. 


2.1 Sampling, Statistics and Sufficiency 


Since the sample is all we have, anything we do will essentially be a function of the 
sample, say T(X,,..., X,). Such a function is called a statistic. 


Definition 2.1 (Statistic) 
Let ¥ be a sample space. Given n > 1, a statistic is a function T : 1” > R. 


Notice that the function 7 cannot depend on the parameter 6, since we do not 
know the latter. If the function T also depends on 0, it cannot be called a statistic. 

Since a statistic T : 4” — R reduces a collection of m numbers to a single 
number, it cannot be injective. As a result T(X}, .., X;,) will in general provide less 
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information about @ than the complete data (X),..., X;,) will. For some models, 
however, we are able to choose a statistic T such that T(X),...,X,) is equally 
informative about 6 as (X1,..., X;,) is. Such a statistic is called a sufficient statistic 
(because it suffices to use T(X,,..., X,) in lieu of (X1,..., Xn)). 


Definition 2.2 (Sufficiency) 

Let X1,...,Xn a fo. A Statistic T : X¥" — R is called sufficient for the 

parameter 0, if P[X; < x1,...,X_ < xX,|T = t] does not depend on @, for all 

(X1,... see € R" andallt ER. 

The intuitive interpretation of this definition is: given the value of T(X ,..., Xn), 
the conditional distribution of (X1,..., X,) no longer depends on @. Therefore, 
knowing (X,,...,X,) in addition to knowing T(X),..., X,) cannot furnish any 


more or any less information about which @ generated the data. The definition 
is usually hard to verify, but the following equivalent condition is much easier to 
verify: 


Theorem 2.3 (Fisher-Neyman Factorisation) Suppose that (X,,..., Xn) has 
a joint density/frequency function fy,..y,(%1,--.,%n3 9), 0 € ©.A statistic T : 
Xx" — R is sufficient for 0 if and only if there exist g : R x © — Rand 
h: &" — R such that 


Tx, ores: x, (X1, ete] >Xn; 0) = g(T (x1, NG! Xn), h(x, ee olhal)e 


Proof The proof in the continuous case requires the use of measure theory. 
Therefore, we will only give the proof in the case where the X; are discrete random 
variables. Notice that if the X; are discrete, then T(Xj,...,X,) must also be 
discrete. Suppose that T is sufficient. Then, 


ty, Nees X, (%1 rr X, ni 9) = Po[X1 SX1,---, xX), = Xn] 
= Po[X1 SNX1,---, X =NXn, T= T(xX1 prees Xn)] 


+Po[X, = x1,..., Xn = Xn, T # T(X1,--., Xn)] 
-——————S 
=0 
= Po[T=T(x1,...,Xn)|/Po[M1=x1,..-, XxX, = Xn|T = T(x1,...,Xn)] 


Since T is sufficient, the second term is independent of 6 and so the Fisher-Neyman 
factorisation follows. To prove the converse, suppose that fy, ..v,(%1,.--,%nj 0) = 
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g(T(x1,...,Xn), O)A(%4,..., Xn). Then, 
Pol[X, = His ccvy Xp = XnlT = t] 


Polk, =X1,...,Xn =x,,T =1] 
Pe[T = ¢] 


Po[X1 = X1,..., Xn = Xn 
oui als TGr,..05%4) = 4 


Po[X1 ad Ce Xn = XnJU{T (x1, . +13 Xn) = t} 
Dynex +++ Dynex PolM1 = yi. Xn = Yn) T(y1,---5 Yn) = 


S(T, --- Xn)3 AQ, MNT Or, --- Xn) = 
Lyex ane ey ex g(T(y1 genes Yn)3 A)hA(y1 ey y)UT (1 pesky Vn) = t} 


_ S(t; O)A(X1,..., Xn UT (01, .. Xn) = t} 
at; 0) yex wai yy ex h(i, sees yi) UT (1, see Yn) = t} 


h(x PEE eS Xn UT (x1 SEE gS x; n) = t} 
yex ee ex hy... -. Mn) UT On... - Yn) = th 


and the latter does not depend on @ because neither h (by its definition) nor T (being 
a Statistic) depend on 6. Oo 


Example 2.4 (Estimating the Bias of a Coin) 


i=1 


Therefore, the Fisher-Neyman factorisation is satisfied with T(X,,...,X,) = a {Xx = 
1} = doy_, X; (the last equality is because each X; is 0 or 1), g(t, p) = p'(1 — p)"! and 
h(x1,..., Xn) = 1. It follows that es , X; is sufficient for p. Intuitively: knowing the total 
number of heads is all that matters as far as learning about p. Knowing the precise order in which 


these heads came up is irrelevant as far as p is concerned. O 


When applied to a sample, any statistic (whether sufficient or not) becomes itself 
a random variable, one that has a distribution of its own. This is called a sampling 
distribution, because it arises as the result of random sampling. 
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Definition 2.5 (Sampling Distribution) 


Let X),...,Xy ae F andT : X" — R bea statistic. The sampling distribution 


of T under the distribution F' is the probability distribution 
Fr(t) = P[T(X%,...,Xn) < ¢], teER. 


Remark 2.6 (Notation) We always consider statistics as applied to a sample, 
and so we will very often suppress the dependence of the statistic on X;,..., X,, and 
write simply 7 instead of T(X,,..., X,,). In this notation, the sampling distribution 
of T under F is Fr(t) = P[T < t]. 


Exercise 21 Let X),..., X) us Unif (0, @). Show that T(X),..., Xn) = Xn) isa 
sufficient statistic for @, and find its sampling distribution. 


Exercise 22 Let X1,...,X, '““ Pois(a). Show that T(X,,...,X,) = Wi", X; 
is a sufficient statistic for A, and find its sampling distribution 


Note that in the definition of the sampling distribution of T we specified under 
which distribution it occurs. This needs to be done, since changing the distribution 
of X;,...,X, to some G instead of F will also change the sampling distribution 
of T. In this chapter we will investigate precisely the dependence of this sampling 
distribution on the form of T and the form of F’. Specifically: 
¢ We will investigate some special forms of T and some special cases of F where 

the sampling distribution is known exactly. 
¢ In more general situations, when the form of 7 and F might not allow for 

a straightforward determination of the sampling distribution, we will try to 

give ways of establishing an approximate distribution (and the mathematical 

framework required to make sense of “approximate distribution’). 

The statistics T that we will focus on will be sufficient statistics, and the models 
F will be members of exponential families. 


2.2. Sampling from a Normal Distribution 


We begin with the simplest possible problem: establishing the sampling distribution 
of the statistics 


£=25 7, & = SK, - 0) 
i=1 i=1 


when the sample X,,..., X, is arandom sample from the normal distribution, i.e. 


Vin N(,07). Note that X is simply the empirical mean, while S? 
is n/(n — 1) times the empirical variance (the reason for using S* instead of the 
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empirical variance will be seen very shortly). Though this problem seems relatively 
elementary, we will see that, for many other distributions, and for many other types 
of statistics, we can reduce the problem of determining the sampling distribution 
of those statistics to (approximately) a problem involving empirical means and 
variances of (approximately) normal random variables. We summarise the sampling 
distribution of ¥ and S? in the next proposition. 


Proposition 2.7 (Gaussian Sampling) Let X,,...,X), Cat (11,07). Then, 
1. The joint distribution of X,,..., Xn has probability density function, 


n/2 n 
i 1 ; 
F¥1:¥n Os +++ %n) = (=) ops toi | . 


2. The sample mean satisfies X ~ N(p,07/n). 
3. The random variables X and S? are independent. 


a= Il 
4. The random variable S* satisfies a2 Sea 


Proof For part (1), it suffices by independence to take the product of the marginal 
N(1, 07) densities in order to arrive at the expression for the joint density. 

For part (2), the fact that the random variables X\,...,X, are independent 
normal variables implies that }*_, X; is also a normal random variable, with mean 
np and variance no? (by Corollary 1.35, p. 25). It follows that ¥ = n7! ie Xi ~ 
N(u,07/n). 

For part (3), we note that if we can prove the independence of X from X; — 
X,...,X, — X then it will immediately follow that XY and S? are independent. To 
chow this, write 


MieX & ¥pHXpo Ry FH 2 :6cgn. 


Notice that the transformation (X1,..., Xn) +> (“%1,..., Yn) is a linear bijection 
R” — R" because 


Y=X M=%-WLY 
Y= X7— x X= Y2 = Y, 
Y,=X;-NX¥ X3=¥34+Y, 


Y, = X,—-X CewLy, 


Since the transformation is linear, its Jacobian is a constant that does not depend on 
(X\,.., Xn) (it is in fact equal to 1/7). It follows from our results on transformations 
of random variables (Theorem 1.33, p. 23) that the joint density of (Y;,..., Y,,) is 
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given by 


Ty, goers y¥,(V1,---5 In) = fx, 5ae83 x, (X1,---,Xn) 
n lx — py? 
= mare] ay o 4 
n ne eee oe oe ea : 
= ———~ exp {- 
(2102)"” n| Pal oO ) 


i=1 


But )0)_, (4 —X) = DO/_) Xi —NX = nX —nxX = 050, on the one hand, }°7_ (x; — 
X)(X — 4) = 0 and, on the other hand, (x) — ¥) = — )>/_,(x; — X). Using these 
two identities gives us: 


1 - _ 2 
oe (do. xP +n — w)| 


- (0 x) 4 Lie xP + n(X u) 


= (2002)"/? ex 


| 
| 
peta 


exp 


2 n 
xe 0) t Yo (x; — XP + 0% »|| 
i=2 


n 1 Me 2 n 
~ (2102)"/? oo| 22 (5 ») + dv +n(yi— w|| 


2 
Jn 1 (> : 2 
= raya &*P ye) Poe 
(210?)' me 20? i=? i=2 
SS 


1 1 
(2202/n)'/? aid 207/n 


(2102)"/” 


Notice that f5(y1) is the marginal density of Y, = X ~ N(,07/n), as proven 
in part (2). Therefore, if we integrate both sides with respect to y,, we obtain that 
fo02.---, Yn) is the joint density of (Y2,..., Yn). We thus conclude that 


F¥ Yn Vises Yn) = Sr WW Svp,.¥n V2 0+ Yn) 


Consequently, Y, = X is independent of Y, = X,—X,...,¥, = X, —X. Since 
(X, — X) = —0"_,(X; — X), it follows that Y; is also independent of X, — X. 
This proves (3), i.e. that X and S* are independent. 
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To prove (4) we note that 


Yi% — Ww)? = DOG — XY + 20K — X)(K — w) + 0(¥ - wy? 
i=1 i=l i=l 
=0 


= (n—1)S? +n(X — py’ 


Since we have proven in part (3) that S? and X are independent, it follows that the 
MGEF of Q must be the product of the MGFs of V and of W (again by Lemma A.10, 
p. 168): 


Mo(t) = My(t)My(t). 


From part (2) we know that es N(O, 1) and thus W ~ y? (as the square of a 


o//n 
standard normal random variable, see Eq. (1.4), p. 21) and so 


My(t) = (1 —-22)-"?. 
hig De ass, 
We also know that 2h iG N(O, 1), so it is also true that (4+) ue ce Therefore, 
the MGF of Q is equal to: 


Mo(t) =| JQ —-22)7'? = a -21-"”. 
i=l 
Summarising, we have that 
(1 —21)7"/? = My(t)(1 — 21)7'?, 
—— ——— 
Mo(t) My (t) 
from which it follows that 


My(t) = (1—2t)-°- 9”, 


This is the MGF of the y?_, distribution. Since the MGF completely determines a 
distribution (Proposition A.9, p. 165), this proves part (4) and completes the proof. 
oO 


The following follows immediately from the theorem: 
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Corollary 2.8 (Moments for Normal Sampling) Let X),...,Xy, Bey, Give): 
Then, 


_ = 2, 2 4 
x) =e, Var(X) = —, t[S?] =0?, Var(S?) = — 


n—1- 


This last result explains why we used the factor (n — 1)! instead of n~! in the 
definition of S?. This definition gives us a statistic whose expectation is equal to the 
true variance. Finally, we mention here a result that we will find quite useful later. 


Theorem 2.9 (Student’s Statistic and Its Sampling Distribution) Let 
X,..., Xn “ N(u, 02). Then, 


eee ae 
S/J/n 
Here t,—-1 denotes Student’s distribution with n — | degrees of freedom. 


Definition 2.10 (Student's t Distribution) 
A random variable X is said to follow Student’s t distribution with parameter 
k &€ N (called the number of degrees of freedom), denoted X ~ ty, if, 


(=) (: re vy 


T 
ky = 
fx (x;k) P (5) Vex k 


Assuming k > 2, the mean and variance of X ~ t, are given by 


k 
[KX] = 0, VarlX] = -—. 


The mean is undefined for kK = 1 and the variance is undefined for k < 2. The 
moment generating function is undefined for any k € N. 


Proof of Theorem 2.9 Let Z = (X — )/(o//n) and V = (n—1)S*/o7, and note 
that 


Z _X-u 
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Thus, to prove the theorem, we will find the density of T. To this aim, we observe 
that by Proposition 2.7 (p. 45) we have 

1. Z is a standard normal random variable. 

2. Visa eae random variable. 

3. Z and V are independent. 

We will first find the joint density of (JT, V), and then integrate to find the marginal 
of T. To this aim, consider the transformation 


cane av 


Z 
JV/(a=1) r) 


whose inverse is given by 


es (FV iS (i) 
n—-1 


and has a corresponding upper triangular Jacobian 
Vie) t v 
J = ( . ) ia =>  det(J,-1(¢, v)) = = 


Since Z and V are independent, it follows that 


ae ee 
1 


fzv(z.v) = f2z@fvlr) = =7— 
2 a) 


The joint density of (7, V) is thus given by 
frv@.v) = fav (g(t, v))|det(F.@, v))| 


= 4 Fle torrie. (_) 
nied —!1 —_ 
22021 (5+) n—1 


Nis 


= 2 
— ge ete 


22 /a(n — DI (5) , 


It now remains to integrate out v, and find the marginal density of T: 


1 
fr(t) 25 (*) Teac! 


50 2 Sampling from Probability Distributions 


Putting 
ae e +1 
PS 9 \ pt : 
we obtain 
2 2) 
v= : y and dv= ; ‘ 
t t 
(Si + 1) (4 + 1) 

and thus 


1 y P = Z 12 -1 
£0" arej eae | *-[an(5 +1) (+1) dy 


1 t? y- n n—2 
= 5 i + 1 +22 | “2 e %dy 
29T (1) Ja —)a (SS d ; 


7 re aoe Gat) “tage 


_, FU). 1 -( i? +1) 
7 () Yaa \n-1 


where the integral in the penultimate line is equal to 1, being the integral of a 
T'\(n/2, 1) density function. Oo 


2.3. Sampling from an Exponential Family 


In the previous paragraph we were able to determine the joint distribution of a 
normal random sample X,,..., X;,, the sampling distribution of two key statistics, 
and the moments of these two key statistics. What if the distribution we are sampling 
from is not normal, but binomial, or Poisson, or exponential? More generally: what 
if the sample X,,...,X, comes from some other exponential family? In other 


aa 
words, let X),...,X_ _~ f, where 


k 
f(x) = exp S> bi Ti(x) — v(Gi.-- be) + S(X) : xEX, 


i=1 


1. Is it possible to find the joint distribution of a sample (X1,..., Xn)? 

2. Is it possible to find the exact moments of some key statistics? 

3. Is it possible to find the exact sampling distribution of some important statistics? 
The next theorem gives an affirmative answer to the first two questions. Unfor- 

tunately, the answer to the last question is: it’s complicated. For simplicity, we will 
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focus on |-parameter exponential families, but the results can easily be suitably 
generalised to the k-parameter case. 


Proposition 2.11 (Sampling from an Exponential Family) Let X;,..., Xy, Es 


Ff, where 


f(x) = exp{oT(x)-v@)+SQ@)}, xEex 


where @ € ® C R, be a density of a 1-parameter exponential family form. 

Then: 

1. The joint density of (X,,..., Xn) is of a 1-parameter exponential family form, 
given by 


Fe oXy O16 Xn) = EXPY OT(X1,..- Xn) — NY) + DI S)P. —-XL EX, 
i=1 
where 
ECSilpccon dfn) = Ce: 
i=l 


2. If ® is open, then y is infinitely differentiable, and 


t[t(X1,...,Xn)] =ny'(b) < co and Var[t(X1,..., Xn)] = ny” (¢) < o. 


Remark 2.12 The theorem demonstrates why t is a key statistic that we are 
interested in: by the Fisher-Neyman factorisation theorem we can immediately see 
that t is sufficient for ¢ (if 6 = (0) for some 1-1 mapping 7(-), then it is clear that 
tT is also sufficient for 0). 


Remark 2.13 The sampling distribution of the sufficient statistic t is still of a 
1-parameter exponential family form, with the same natural parameter ¢ and with 
the identity as a natural statistic, i.e. it is of the form 


fr(t) = expigt — A) + BO}. 


for some A : ® + Rand B : R — R (we will not prove this because it 
requires measure theory). However, an explicit general form of the density cannot 
be given (i.e. we cannot find a general formula for the form of the functions A 
and B). For a simple general formula, we will need to resort to approximations of 
this sampling distribution, and this we do in the next section. Nevertheless, we can 
indeed determine general formulae for the mean and variance of t(X),..., Xn). 
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Remark 2.14 The fact that y is infinitely differentiable when ® is open 
(conclusion 2 of the Proposition) will be taken for granted for the remainder of 
the text. 


Proof of Proposition (2.11) Part (1) is immediate from independence and from the 
form of a 1-parameter exponential family. To prove (2), we first calculate the MGF 
of T(X;), somei <n. 


II 


Mr(u) i. exp(uT (x)} exp{OT (x) — 7G) + S@)}dx 


II 


ou f=) d exp{e + OT) — ut 4) + Sax. 


Since ® is open, there exists an € such that (u + @) € ® if |u| < «. Thusu+¢isa 
valid parameter when |u| < €, yielding /,, exp{(ut+-@)T (x)—y(u+@)+S(x)}dx = 
1. We conclude: 


Mr(u) = exp{y(u + ¢) — y(@)}, jul <e. @.1) 


Since the moment generating function exists for |u| < €, it follows from Proposi- 
tion A.8 (p. 163) that Mr is infinitely differentiable for |u| < € and so it also must 
be that y is infinitely differentiable on ®. Furthermore, Proposition A.8 (p. 163) also 
implies that all moments of T(X;) exist, for all values of @ € ®; and 


d 
IX) = Mr} = y'@) 
u=0 
d2 
I(x] = <5Mrw)| =") + @F. 
u=0 
We conclude that E[T(X;)] = y/() and that Var[T(X;)] = E[T?(X;)] — 


(7 (X;)] = y’(@). It now immediately follows by independence of X1,..., Xn 
that 


fe(%,...,X%_)] = aps nx) = EIT (X))] = ny’) 
i=1 i=l 
Var[t(X1,...,Xn)] = Var bp nx) = )0 Var[T(X;)] = ny"(). 


i=1 i=1 
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Exercise 23 Let X1,...,X, f, where f is of an exponential family form, 


expressed in the usual parametrisation as f(x) = exp[n(0)T (x) — d(@) + S(x)]. 

Assuming that © is open, show that: 

1. If 7 is k-times continuously differentiable (k > 1), invertible, and n’/(0) 4 0, 
then d is k-times continuously differentiable. 

2. If 7 is twice continuously differentiable and invertible with n'(@) 4 0, then 


d'(0) _ _ d"()n'(8) — d'(8)n" @) 
7) & Var[t(X,...,Xn)J =n OE 


E[t(X%,...,Xn)] =21 


Hint: use the inverse function theorem (Theorem A.2, p. 159). 


Remark 2.15 The fact that d is k-times continuously differentiable (for k > 1) 
whenever © is open and 7 is k-times continuously differentiable, invertible, and has 
a non-vanishing derivative (see part (i) of the exercise) will be taken for granted in 
the rest of the text without special mention. 


2.4 Approximate Sampling Distributions 


We saw in the last section that the sampling distribution of the sufficient statis- 
tic t(X,,...,X,) when sampling from a one-parameter exponential may not 
be straightforward to determine exactly. For this reason, we will often try to 
approximate it, assuming that the sample size n is large enough. This requires 
a mathematical notion of what it means to say that the distribution Fy:y,,y,) 
is approximately given by some other distribution G. If we see Fy,y,,..x,) as 
a sequence of distribution functions /, indexed by the sample size n, then 
approximation by G should be formalised by some notion of convergence of F;, 
to G asin — ow. The appropriate type of convergence is called convergence in 
distribution. 


Definition 2.16 (Convergence in Distribution) 
Let {Fitn>1 be a sequence of distribution functions and G be a distribution 


: ee ere ‘ d 
function on R. We say that F,, converges in distribution to G, and write F,, —> 
G, if and only if 


noo 
F,(x) —> G(x), 
for all x that are continuity points of G. 
Remark 2.17 Notice that convergence in distribution is similar to pointwise 
convergence of the sequence of distributions, except that we do not insist to have 


pointwise convergence at the discontinuity points of the limit (recall that any 
distribution is cadlag: continuous from the right, and has limits from the left). 
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Example 2.18 (Maximum of Uniform Random Variables) 
iid 


Let Xj,..., X, ~ Unif (0,1), M, = max{X),..., X,}, and Q, =n(1— M,). 


=x 


P[O, <x] =P[M, >1—x/n]=1 (1 e 


X\" n>co 
> 


n 


Note that the limit is the distribution function of an E xp(1) random variable. O 


Exercise 24 (Law of Rare Events) Let {X;,}n>1 be a sequence of Binom(n, pn) 


d 
random variables, such that p, = 4/n, for some constant A > 0. Prove that X,, —> 
Y, where Y ~ Poisson(A). 


When F,,(x) = P[X,, < x] for some sequence of random variables {X;,},>1 and 
G(x) = P[Z < x] for some other random variable Z, we will abuse notation and 
write 


d 
Xn 2: 


This will be taken to mean that the distribution of X,, can be approximated, for large 
n, by the distribution of Z. So, if we denote t, = t(X),..., X,), then the problem 
of determining an approximate distribution for t(X1,..., X,) is equivalent to the 
problem of finding some random variable Z whose distribution is explicitly known, 


and such that t, =. Z. We will give a partial solution to this problem in the next 
two subsections. 

Before we conclude this introduction, we introduce a second type of convergence 
that merits independent consideration. 


Definition 2.19 (Convergence in Probability) 


If a sequence of random variables {X,,} is such that P[|X, — Y| > €] > 0 for 
all € > O and for some other random variable Y, we say that X,, converges in 


probability to Y and write X, = Y. 


d 
In general X, = Y = X,, — Y, but the converse may fail to hold true: 


Exercise 25 Let {X;,}°2_, be a sequence of random variables with 


X, =(-1)"X, P[X =-1])=P[X =1]= 


Show that X, > X, but.X, & X. 
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Suppose, though, that Y = c € Ris aconstant, and {X,,},>1 is a sequence such that 


d 
X, —> c. Then we have the following result. 


Lemma 2.20 Let {X;}n>1 be a sequence of random variables taking values in 
R, and c € R be some constant. Then, 


d n 
X¥,>¢ <> Pily,-cl>e] — 0, Ve>0. 


Exercise 26 Prove the last lemma. 


2.4.1 Approximate Distributions for Sums 


It was seen in Proposition 2.11 (p. 51) that the sufficient statistic for an iid sample 
X,,...,X, from a one-parameter exponential family 


F(x) = expt oT (x) — y(b) + S(x)} 


is of the form t(X),..., X,) = )0/_, T(X;), where 


i[z(X1,...,Xn)] =ny'(¢) < co «and Var[r(X1,..., X,)] = ny" (¢) < w. 


If we define 


- 1 Li 
T, = X1,...,Xn = T (Xj), 
—t(Xi y= — DIT (Xi) 


i=1 


then we notice that we have a random variable that is built as the average of n 
iid random variables, of finite mean y’(@) and finite variance y’(¢)/n. Though 
the exact sampling behaviour of such averages might not always be tractable, their 
behaviour for large n becomes surprisingly simple. The goal of this section is to 


describe this behaviour. In other words, given Y;,..., Y,, iid random variables with 
i[Y;] = we < oo and Var[Y;] = 0% < ov, we wish to study the approximate 
distribution of )~"_, Y; 


We note that the expectation of }°"_, Y; is nj, which tends to infinity as n grows. 
Therefore, we cannot hope to get a distributional approximation if we do not tame 
this explosion. The first idea that comes to mind is to simply divide by n. That is, 
to look at the empirical mean Y, = 1 yi Y; instead. The expectation of this 
empirical mean is 4, which remains constant with respect to n. By Chebyshev’s 
inequality (Lemma A.4, p. 159), we have that 


noo 


a2 
PUY» -ul>Js "Fo, Ve>0. 
ne? 
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Theorem 2.21 (L? Weak Law of Large Numbers) Let Y\,...,Y, be iid 
random variables such that E[Y;] = @ < oo and Var[Y;] = a? < oo. Let 
Y,= i ya) Yi. Then, 


Vos 


Remark 2.22 (L! Weak Law of LargeNumbers) Actually, the same conclusion 
can be drawn under weaker assumptions: it suffices to assume that E|Y;| < oo, 
rather than Var[Y;] < oo. 


Consequently, the realisations of the random variable Y,, become more and more 
concentrated around its mean as n grows, i.e. (Y — 2) 4. 0. How does Y, vary 
around i, though, as 1 grows? The factor n—! was such that it made n7! )"_, Yi 
converge to a constant. The reason is that multiplying with the factor n~! made the 
variance equal to o7/n, and hence made it converge to zero. The key observation is 
that the mean of c x )77_, Y; scales linearly in c but its variance scales quadratically 
in c. To get a finer approximation we need to consider the re-scaled differences 
J/n(Y — 2). Notice that these have variance o? for all n. The following remarkable 
result tells us that these scaled differences are approximately normal: 


Theorem 2.23 (Central Limit Theorem) Let Y,,...,Y, be tid random vari- 
ables such that E[Y;] = ju < 00 and Var[Y;] = 0? < oo. Let Y;, = 4 Dear. 
Then, 


Vai, — p) > N00, 0°). 


We discuss the proof of the Central Limit Theorem in Sect. A.8 (p. 173). We 
now have an immediate corollary, by combining the Central Limit Theorem with 
Proposition 2.11 (p. 51), that will be very useful for statistical inference: 


Corollary 2.24 (Approximate Sampling Distribution in Exponential Fami- 
lies) Let X,,...,Xn is Ff, where 


f(x) = exp{gT(x)-—v@)+SQ)}, xEex 


where 6 € ® CR. Let 


ies lees 
Tn =—Y T(X;) =n 't(X,..., Xn). 
— TK) = 1h ) 


i=1 
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If ® is open, then 


Vi(Tn - y'(¢)) > NO. y"()). 


2.4.2. Approximate Distributions for Functions of Sums 


What if the statistic whose sampling distribution we wish to determine is not just 
a sum of iid random variables, but some smooth function of a sum? For example, 
suppose that we wish to consider a statistic of the form g(Yn) rather than Y,, itself. 
Can we say anything about the asymptotic behaviour of this new random variable? 
The next three results give us affirmative answers to this question in some important 
special cases. 


Theorem 2.25 (Continuous Mapping Theorem) /f X is a random variable 
such that P[X € A] = 1 and g : R = R is continuous everywhere on A, 
then 


d d 
Xn > X => g(Xn) > 8(X). 
Proof See Sect. A.7 (p. 169) oO 


Theorem 2.26 (Slutsky’s Theorem) Let X be a random variable andc € Ra 
d d 
constant. If X, — X and Y,, ae c €R, then it follows that X, + Y, —> X +c 


d 
and XjY, > cX, asn —> oo. 


Proof See Sect. A.7 (p. 169) oO 


It’s important to note that, in general, one cannot replace the constant c € R 
with a non-degenerate random variable, say Y, in Slutsky’s theorem. The problem 
is that we have no information on the joint distribution of (X,,, Y,,). For a simple 
counterexample, take X, = —Z+n7! and Y,, = Z—n—! = —X,, for Z ~ N(0, 1). 
Then, X, aA Z (since —Z ~ N(0, 1)), Y, = Z, but for alln, we have X,,+Y, = 0, 
and thus X,, + Y,, fails to converge in distribution to 2Z. 


d 
Theorem 2.27 (The Delta Method) Let Z,, := a,(X,—0@) > Z where ay, 0 € 
R for alln and a, + oo. Let g : R > R be differentiable at 0. Then, an(g(Xn) — 


g(9)) & g'(8)Z, provided that g'(0) £ 0. 
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Proof Taylor expanding (Theorem A.1, p. 159) around @ gives 
8(Xn) = 8(8) + 8'(Gr)(Xn — 9), 


where 0* lies between X,, and 6. Thus |0* —6| < |X, — 0| = a7! |an(Xn — 9)| = 


a, LZ *. 0.asa result of Slutsky’s theorem. Therefore, 0* 486. By the continuous 


ses 7 


mapping theorem it now follows that g’(0*) a g' (9). Consequently, 


an(g(Xn) = g(8)) = an(g(@) + g (Or )(Xn = 0) = g(8)) 


gO" )an(X — 0) S g'(8)Z, 


using Slutsky’s theorem once again. Oo 


These three results enable us to obtain new limit theorems (new approximations) 
from old ones. For example, the central limit theorem tells us that if Y|,..., Y,, are 
as . : : = d 
iid with mean ju and finite variance 07 < oo, then ./n(Y,, — 4) —> N(0, 07). Now, 
the delta method implies that 


Vnlg(¥n) — g()) > NO,02(¢'(u))?), 


for all continuously differentiable functions g. Now let W,, be a sequence of random 


: P : , 
variables such that W,, — o. It is an easy exercise to use Slutksy’s theorem and 
conclude that 


g(¥n) — g(u))\ 4 
vn (ee > N(O.(8/(W)?). 
Exercise 27 Let X,...,X), i Pois(A), where A € (0, 00)\{1} and consider the 
probability 7 = P(X; = 1) = Ae~*. We wish to approximate m by 7, = A,e~*” 
where A, = 1 >| X; (effectively replacing the true mean in the expression for 


the probability by the empirical mean). We know that the A satisfies the central limit 
theorem. Show that this also gives a central limit theorem for 7, in the form of 


Jn(Tn ~~ ™) d 
- — Y, 
Aner —An) 


where Y ~ N (0, 1). Hint: you will need to use the central limit theorem, the delta 
method, the law of large number, and Slutsky’s theorem. 
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Exercise 28 Let x),...,x, be independent realisations of a random variable X 
possessing a continuous density function f. Show that the histogram hist,,.x, (V) 
converges in probability pointwise to f(y), asm — oo, h, > 0 and nh, — oo. 
Hint: the number of observations in the interval /;,, given by Ny = 7), ltxje Teds 
follows a Binom(n, p,) distribution, where p, = th J (x)dx. You will need to use 
the fact that 


Pn = 
hy 


Pn 


a i ar f(y) 


S(y) 


s 


’ 


Nn 
nhy 


as well as Chebyshev’s inequality (Lemma A.4, p. 159). 


Point Estimation of Model Parameters 


We now return to the bigger picture: we are modelling a stochastic phenomenon by 
a regular parametric family of distributions ¥ = {Fy : 6 € ©}, where O C R?. We 
observe n independent and identically distributed outcomes from the phenomenon, 


say X),...,Xn ue F for some 6) € ©, but do not know/observe the 6 € © that 
generated them (the true state of nature). With this iid sample at our disposal, we 
wish to make inferences about 6. Perhaps the most obvious inference we may wish 
to draw is: which is the 6 that generated the sample X,..., X,,? This is known as 
the problem of point estimation. Since X,, .., Xn is all we have available to estimate 
the value of 6, we will use some function of the sample as an estimator. 


Definition 3.1 (Point Estimator) 
A statistic whose range is contained in @ is called a point estimator. Equivalently, 
a point estimator is a statistic T : ¥” > ©. 


p> Remark 3.2 Since the purpose of an estimator is to provide a guess of the true 
6 that generated the data, we typically denote it by 6. Note that 6 is a deterministic 
parameter, but 6 is arandom variable, since 6= T(X,...,Xn). 


Clearly, the purpose of an estimator is to estimate the unknown parameter. But, 
according to the definition, essentially any function of the sample that maps into © 
could be an estimator. Which one should we pick? Or, even more simply, if we are 
presented with an estimator 6 how can we judge its quality? 

The important thing here is that estimators are random variables. Therefore, for 
every realisation of the sample X,,..., X;, the estimator 6 will take a different value. 
A good estimator should be such that its typical realisations fall “close” to 6. In other 
words, the distribution of a good estimator is concentrated around the value of the 
true parameter 0. 
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3.1 Criteria for Comparing Estimators 


Still however, the question remains: how can we measure the concentration of the 
distribution of 6? There are many different criteria that one could use, but there are 
two basic concentration characteristics that statisticians typically focus on: the mean 
and the variance of 0. Why? 

1. One reason is that the mean and the variance are easy to interpret: the mean [8] 


tells us how close to the target our estimator is on average. And Var[4] tells us 
how dispersed our estimator is around its average. If both are small, we should 
have reasonable concentration. 

2. A second reason is that the exact distribution of 6 is often unknown. As we 
saw in previous sections, we then need to resort to asymptotic approximations. 
Relatively often, it happens that the approximate distribution of 6 is normal. 
And, for the normal distribution, the mean and the variance capture all of the 
concentration characteristics. 

3. Even if the approximate distribution is not normal, concentration inequalities 
such as Markov’s inequality or Chebyschev’s inequality (Lemmas A.3, p. 159 
and A.4, p. 159) can be used to bound the probability PAO — 6|| > e} 
(which expresses concentration) given knowledge of a mean and a variance. Such 
inequalities are valid regardless of the precise distribution of 6. 

It turns out that the so-called mean squared error takes both the mean and the 
variance into account. 


Definition 3.3 (Mean Squared Error) 
Let 6 be an estimator for a parameter 0 of a parametric model { Fg : 6 € ©}. The 
mean squared error of @ is defined to be 


MSE(6, 0) = E[||6 — 4|/"). 


Notice that the MSE depends on both our estimator and the true state of nature. 
Therefore, an estimator 6 may perform well if the true @ is in some region of the 
parameter space ©, but not as well in other regions of the parameter space. We will 
revisit this issue later. 

For the moment, though, we see why the MSE is connected with the mean and the 
variance of 6: 


Lemma 3.4 (Bias-Variance Decomposition) Write 0 = (6},..., Bey The 
mean squared error of an estimator admits the decomposition 


Pp 
MSE(6, 6) = ||E[6] — 0\? + El|6 —E@)|7] = [lbias(6, @)||? + S > VarlOcl. 
p= 
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Remark 3.5 We call the quantity [8 -é)= bias(6, @) the bias of the estimator 
6 at true parameter 0. It expresses how far off 6 is from 6 on average. When the bias 
at some coordinate of 6 is positive we have overestimation; when it is negative we 
have underestimation; when the bias is zero, we speak of an unbiased estimator. 
Notice that the variances Var[6¢] can also depend on @, even though this is not 
explicitly reflected in the notation. 


Proof of Lemma 3.4 We expand the MSE after adding and subtracting E[6]: 


[|] — 6|)"] = E[||6 — E[6] + E[6] — 6|] 
- a — E[6] + E[6] — 6)" (6 — E[6] + E[6] — 6)] 
= ||E(4] - 91? + E[1}6 - E@)|P] + 2E| 6 - B16)" 16] - 6) | 
= |E[6] — 4||? + Ell6 —E) |] + 2814] — E16) ' (E14) - 4) 
—-_-——— 
=0 
A P A na 
= ||E[6] - 0? + ))E& — E(4))"), 
k=1 
by linearity of the expectation and since ( [6] — @) is deterministic. Oo 


Exercise 29 (Unbiased Estimators Don’t Always Exist) Let Y~ Binom(n, p), 

where p € (0, 1). 

1. Show that Y/n is an unbiased estimator of p. 

2. Show that there exists no unbiased estimator of 1/p. 

3. Show that there exists no unbiased estimator of the natural parameter @¢ = 
log (4). 

Remark: ¢ is called the log odds ratio. 


As was noted earlier, the concentration of an estimator 9 around the true 
parameter 6 can always be bounded using the mean squared error (provided that 
the estimator 6 has finite variance). 


Lemma 3.6 Let 6 be an estimator of 0 € R? such that Var|6] < oo. Then, for 
alle > 0, 


e MSE(6 
BIId — 6 >= 20) 
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Proof Let X = 6 — 9||?. Since € > 0, Markov’s inequality (Lemma A.3, p. 159) 
yields 


PI6 — 4] > «] = P[X > 2] < aI = tet _ é) 


oO 


Let 6, = T(X,,..., X,) be an estimator of a parameter 0 (we write the subscript 
n to emphasize the dependence on the sample size). Notice that if MSE(@,, @) 


: eee ~ 2 
converges to zero as n — oo, then the previous result implies that 6,, —> 0. When 
an estimator has this last property, we call the estimator consistent. 


Definition 3.7 (Consistency) 
An estimator 6, of 9 constructed on the basis of a sample of size n is called 


: of P 
consistent if 6, —> 0 asn > oo. 


> Remark 3.8 Notice that convergence of the MSE to zero implies consistency. 
The converse is not true in general, though. 


Though we will focus on the mean squared error, it is certainly not the only 
criterion for judging the performance of an estimator: there are many other criteria 
that can be imagined. In general, one can define a loss function, £ : @x© — [0, oo), 
which represents the loss incurred when we estimate 6 by é. Then, one uses the 
average loss, or risk, as a measure of performance: RO, 0) = [L(8, 0)]. The 
“goodness” or “badness” of an estimator will clearly depend on our choice of loss 
function, and so this choice must be made judiciously. Notice that the mean squared 
error is the risk function obtained when the loss function is defined to be the squared 
Euclidean distance. 


3.2. Fundamental Limitations to Estimation Accuracy 


We can use the mean squared error to compare any two candidate estimators, and 
so have an idea of their relative performance. It would also be nice to have a 
more absolute benchmark in order to compare the mean square error of any single 
estimator to a best achievable mean square error for the given problem. It turns out 
that this is a difficult problem, because it is equivalent to finding a uniformly optimal 
estimator: an estimator T, such that MSE(T,,0) < MSE(T, 6) for all 6 € © and 
all candidate estimators T. We will not consider this problem here, and will only 
remark that in general this problem cannot be solved unless we restrict the class of 
estimators under consideration. 

Instead, we will consider a slightly simpler version of the question posed, namely 
the following: for a given bias, can we make the variance of an estimator arbitrarily 
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small? For example, if the bias is zero, and we have an unbiased estimator, is there 
a limit to how small the variance can be? The answer is given in the following 
theorem. 


Theorem 3.9 (Cramér-Rao Lower Bound) Let X,,...,X, be an iid sample 
from a regular parametric model f(-;@), © C R. Let T : X" + © bean 
estimator of 0, for all n. Assume that: 

I. Var(T) < co, for all 0 € ©. 


If we denote bias of T by B(0) = E(T) — 0, then it holds that B(@) is 
differentiable, and 


ne (6'@) +1) ___ (+1) 


a g . 
nf (sploeser6)) F(x; @)dx nbs | toe £08) 


Remark 3.10 When X is a discrete random variable, the integrals above will be 
replaced by sums. 


Even if the bias is equal to zero, the variance will still be bounded below 
by the inverse of the positive quantity nf, (3 log f(x; ay)” f(x; @)dx = 
o (5 log f(X1; 9)" = nI(0), and thus so will the MSE. For unbiased estimators, 
the variance (and hence the MSE) has the fundamental lower bound 1/n/(@). The 
quantity 7(0) is called the Fisher information or simply the information.' The 
presence of the term n~! on the right-hand side of the Cramér—Rao inequality tells 
us that the best achievable variance when the sample size is n is of the order n~!. 

The good news is the following: if we are interested in looking only for unbiased 


= 


estimators, and we find an unbiased estimator with variance (nJ(0))~!, then we 
know that we’ve found the best unbiased possible estimator in terms of MSE, 
regardless of the true value of 0. 


'More generally, we may define 


F) 2: 
Th (@) =E E log Ix, sini Xn (X) pres Xn} ) 


to be the Fisher information of a sample of size n. In the case of iid random variables, we have 
T,(8) = nI(6). 
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Proof of Theorem 3.9 First we prove the theorem in the special case n = 1. Define 
the random variable U\(0) = a log f(X1; 8). Since the probability model is 
regular, the support of f does not depend on 6. Therefore, by Assumption (3), 


0 a :0 
won = (<p boe.rx:0) fee dae = fi BIO poss 
P ij 
= i ag 1 O)ax = a | fea 
= 


Therefore, 


2 
varlti(@)] =F [U70)] = f (poe For) Fords = 10). GD 


Again, since the support of f does not depend on @ and using Assumption (2), 


Haye. 8. . ee oO = 
BO) = = 2 T(x) f(x; @)dx — 1 = - T(x) 5 fxs 0)dx — | 


a f(x 9) 
= a0 : 
= fe T (x) 7058) F(x; 0)dx —1 


= [ re (Foie f(xs)) flrs @)dx — 1 
x 00 


= E[TU,(6)] —1 = | aT U,(6)] — EIT] xo -1 
—S—S 
=0 
= Cov [U, (6), T]— 1 


=> Cov [U;(6),T] = B'(0) +1. 


Now, using the correlation inequality” we have 


Cov [U; (8), T] 


< / 2 < 
Wwal @WalT]| (B'(8) + 1)? < Var[U; (6)]Var[7)]. 


Finally, Eq. (3.1) allows us to conclude that 


(B'(8) + 1) . 
Se (4 log f(x; ay)” f(x; 0)dx 


Var[T] = 


>A consequence of the Cauchy—Schwarz inequality. 
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which proves the theorem when n = 1. For a more general n, define U; = 
at log f (Xj; 0) and U(@) = >~7_, U; (0). Note that the U;(0) are independent and 


identically distributed as U;(@). Then, by linearity and independence, respectively, 


n 


S[U(8)] = ) | E[Ui(O)] = nE[Ui(6)] = 0 


i=1 


n 2 
Var[U(0)] = S— Var[U; (0)| = nVar[U, =n f (a log f(x; 0) F(x; A)dx 


i=1 


'(6) = I T (%1,.--,%n) (> oD) ] | fea: Oda ...dxn-1 
i=1 


i=l 


= Cov[U(@), T] — 1. 


Applying the correlation inequality to Cov[U(@), 7] then gives the result and 
completes the proof for general n > 1. Oo 


Exercise 30 Let X,,...,X), ‘! Poisson (A). Show that the estimator Xn =X,= 
>o_, Xi /n of A attains the Cramér—Rao lower bound. 


Remark 3.11 Condition (3) of the theorem asks that we be able to interchange 
integration and differentiation. It can be checked on a case-by-case basis (for given 
Fx), Or it can be replaced by any sufficient conditions on T and f(x; 6) allowing 
this to be the case. Here are two sets of conditions. Either would suffice for (3) to be 
true: 

1. If T is such that we can write f(x; 0) = exp{n(@)T (x) — d(@) + S(x)} with 
n(-) being differentiable on © with n(-) and d(-) functions on © as in exercise 23 
page 52. In other words, if we have a one-parameter exponential family and the 
statistic JT in question is its natural sufficient statistic (Bickel & Doksum [1], 
Proportion 3.4.1). 

2. For f(x;@) a density with 0 € R, and T(x) being a real function, we have 
on Je T(x) f(x: O)dx = fy T(x) i f(x: 0)dx for all #0 € (a, b) if the following 
four conditions hold (Durrett [10], [Theorem 9.1]): 

(a) fy IT(x)| f(x: @)dx < co forall 6 € (a,b). 

(b) For any fixed x € ¥, a f(x; 0) exists and is a continuous function of 6 € 

(a,b). 
() de T(x) 4 f(x: @)dx is continuous on (a, ). 


d) fy f? |TO)% fx; 0)| dOdx < 00. 
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3.3. Methods for Constructing Estimators 


Now we know a way to judge the quality of an estimator, and in some cases we also 
know what the best quality we can hope for is; but how can we propose candidate 
estimators? Any function from ” — @ is an estimator, so there is a bewildering 
variety of choice! We need general methods (or principles) that can be applied to 
any model in order to produce a candidate estimator. More ambitiously, we want 
methods that will generally yield reasonable estimators. If we have such methods, 
then we can study the properties of the estimators they induce. 


3.3.1 The Method of Maximum Likelihood 


Perhaps the most important method of point estimation is based on the notion 
of likelihood. We first give its rigorous definition, and then consider its intuitive 
interpretation. 


Definition 3.12 (Likelihood for iid Collections) 
Let X,,..., X, be a collection of independent and identically distributed random 
variables with density (or mass function) f(x; @), where 0 € R?. The likelihood 
of 6 on the basis of X;,..., X, is defined as 


L@) =] ] f(%: 4). 


i=1 


That is, the likelihood of 6 is the joint density (or mass function) of the 


random variables X;,...X;,, evaluated at (X),..., X,), but seen as a function of 
0. Notice that the likelihood function is a random function, since it depends on the 
random sample X),..., X;,. Strictly speaking, we should write L,,(@) to denote the 


likelihood, in order to stress the fact that it depends on the sample size. Nevertheless, 
we will suppress the 7 index in general to simplify notation, with the exception of 
occasions where it is necessary for clarity. 

The interpretation of the likelihood is easiest in the discrete case. In this case, the 
likelihood of @ is the probability of the observed sample (X),..., X;,), viewed as a 
function of @. In other words, in the discrete case, the likelihood L(6) is the answer 
to the question: what is the probability of our observed sample when the parameter 
is taken to be equal to 6°? When @ is unknown, it would seem that its most suitable 
estimate would be a value @ that makes what we observed most probable—a value 


3In the continuous case, a similar interpretation is feasible by considering a small neighbourhood 
around our sample: since F(x + €/2; 60) — F(x — €/2;6) © ef(x; 6) as € J 0, we can think of 
e" L(@) as the approximate probability of a square neighbourhood of edge length € centred around 
our sample, and viewed as a function of 6. 
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that is most compatible with our empirical observation. This motivates the definition 
of a maximum likelihood estimator. 


Definition 3.13 (Maximum Likelihood Estimator) 
Let X,,..., X;, be an iid random sample from a distribution Fg with density (or 
mass function) f(x; @). Let 6 be such that 


L(6) <L(6), WOE. 
Then 6 is called a maximum likelihood estimator (MLE) of 6. 


When there exists a unique maximum of the likelihood function, we speak of 


the maximum likelihood estimator 6 = arg maxL(@). When the likelihood is a 
dco 
differentiable function of 8, we may determine the maximum likelihood estimator 


using differential calculus. A maximum of the function L(@) must be a root of the 
equation 


VoL(0) =0 


and so solving this equation will provide us with a candidate MLE. Before we 
declare a root 6 of this equation to actually be an MLE, we will first need to verify 
that this is indeed a maximum (and not a minimum! See Exercise 32, p. 74). If the 
likelihood is twice differentiable, this can be done by verifying that 


~ VL()|,-¢ > 0, 


ie. that minus the Hessian matrix is positive definite. When the parameter is one 
dimensional, this reduces to verifying that the second derivative is negative when 
evaluated at the root of the likelihood equation. 

Notice that solving Vg L(@) = 0 will involve the determination of the derivative 
of a product of n functions, which is a tedious calculation. To avoid this, we focus 
on maximising the loglikelihood £(@) := log L(@) instead of the likelihood itself. 
Since the log transform is monotone, the likelihood and the loglikelihood have 
precisely the same maxima and minima. But the advantage of the loglikelihood 
is that it is a sum rather than a product of n functions, making calculations 
straightforward: 


€(6) = log (1 F(X: ») = )Jlog (Xi: 4). 


i=1 i=1 
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Again, if the loglikelihood function is twice differentiable, an MLE 6 of 6 will 
satisfy 


Vol()igg =O & — VGL(O)|,_g > 0. 


Example 3.14 (Bernoulli MLE) 


Let X),..., Xh Bus Bern(p) and suppose we wish to use the maximum likelihood method to 
construct an estimator for p € (0, 1). The likelihood is 


L(p) =] ] £0%5 p) = [] oa py = p= pyr, 


i=1 i=1 


Taking logarithms on both sides, we obtain the log likelihood function 


€(p) = log p )) Xi + log(1 — p) (} = x) ; 


i=1 i=l 

We notice that this function is indeed twice differentiable with respect to p, and calculate 

d 4 n 4 n 

ph) =P YX — dp) (n-> x J. 

i=1 i=l 
Solving for £’(p) = 0 with respect to p is equivalent to solving 
Dx -d-' (n= Lx) =0 
i=1 i=1 


which can be seen to have the unique root 1 waar = X. Call this p. It is our candidate for an 
MLE, provided that it yields a maximum. Notice that 


a2 5 n _ n 
pl) = —p d> x; -(-p)? (nx) 


i=1 i=1 


which is always non-positive because 0 < y=, Xi <n almost surely and p € (0, 1). Hence 
p=X =1)%_, X; is the unique MLE of p. oO 


Example 3.15 (Exponential MLE) 


Let Xj,..., Xn me Exp(A) and suppose we wish to use the maximum likelihood method to 


construct an estimator for A € (0, co). The likelihood is 


LA) =|] £34) = [] a0! = A" exp ASX; 


i=1 i=1 


i=1 


3.3. Methods for Constructing Estimators 71 


Taking logarithms on both sides, we obtain the log likelihood function 


&(A) = nlogd—A)_ X;. 


i=1 


We notice that this function is indeed twice differentiable with respect to p, and calculate 


d 14 n 
la) = na — ea 


i=1 


Solving for €’(A) = 0 with respect to A yields the unique root (4 rai ‘i = 1/X. Call this 


A. It is our candidate for an MLE, provided that it yields a maximum. Notice that 


# yay = 
di? ~ 42 


which is always negative because A > 0. Hence i= (2 yaw i) = 1/X is the unique MLE 
of A. Oo 


Example 3.16 (Gaussian MLE) 


Let X),..., Xn Pe N(,07) and suppose we wish to use the maximum likelihood method to 


construct an estimator for 9 = (,07) € R x (0, 00). The likelihood is 


n n l x = ie) 
L(p,07) = I] F (X30) = I] se | : —# 


i=1 i=1 


= ( 1 Ye Tia = oF 
Vino} 20? 


Taking logarithms on both sides, we obtain the log likelihood function 


n 1 n 
£4, 07) = —> log(2x0”) — => DXi — wy. 


i=1 


We notice that all second order derivatives with respect to jz and o? exist, and calculate 


a 5 1 n 
gto) = oa a - 4) 


i=1 


0 2 n 1< 7 
a3.) = —55 +55 wy. 


i=1 


Solving for Vi.o2)€(4,07) = 0 with respect to (14,07) yields a system of two equations in two 
unknowns. The unique root of this system can be seen to be (X, nS (Xi - X)’). Call this 
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(jt, 6”). It is our candidate for an MLE, provided that it yields a maximum. Notice that 


a t(.02) n a Uu,02) n 1 Sx ., 
—Slwoy=-sS. lo) =a- a = 
ap? e 02 a(a?)? ae 204 a = B 
a? a yr (Xi -— pw) ne nX 
say tu, 0”) = l(u,07) = -SS = 
dpdo2 (450°) do2du Ge) ot ot 
Evaluating these second derivatives at (ji, 67) yields 
a? n a n no? n 
55 U(u,07) =a pollo’) = nee = = ae 
Op? (wory=uar, — — A(O*P (uo2)=(i.2) 26% 26% 
2 np — np 
aud 5 E(u, 07) = ja20 L(t, 07) So 0. 
ad (4.07)= (8?) one (u.07)= (8?) o 


We conclude that the matrix [- Vi 92l(H,07)| is diagonal. To show that it is 


oe 
positive definite, it suffices to show that its two diagonal elements are positive, which is true since 
G? is positive with probability one. Therefore, the unique MLE of (1, 07) is given by 


ee -_ 
(2,6?) = (x. hi -xy'] 


i=1 


Oo 


There are situations where we might not be interested in estimating @ itself, but 
rather some transformation @ = g(@). If g is a bijection, we do not need to repeat 
the entire estimation process, since the maxima of a function are equivariant to a 
reparametrisation of its domain. 


Proposition 3.17 (Bijective Equivariance of the MLE) Let { f(-; 0) : 6 € O} 
be a parametric model, where © © R?. Suppose that 6 be an MLE of 8, on the 
basis of a random sample X,,...,X, from f(x;@). Letg: @—> ® CR? bea 
bijection. Then, d = g(0) is an MLE of = g(@). 


Proof Define h(x;¢) = f(x; g7'(@)), and note that h is a well-defined function, 
because g! ® — © is well defined. The function h(x;@) is simply the 
density/frequency of X; under the parametrisation given by 6 € ®. An MLE of 
od, say d must satisfy 


[] 2G: 4) <[] Aird), vor ® 


i=1 i=1 
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Let 6 be an MLE of 6, and let d = (0). Let @ € ®@ be arbitrary and observe that 


[2:6 =[] #8 @) s T] £9) 


i=l i=1 i=1 


=[[ fs" @) 


= 
= ] [2@%:4). 
4 


which proves the proposition. Oo 


Example 3.18 


Let X,.., Xn oe N(j, 1), and suppose we are interested in estimating the probability PLX, < x], 
for a given x € R. We note that 


PIX, Sx] =P[Xi-wsx—-p)] = O—p), 
where © is the standard normal CDF (see Lemma 1.32, p. 22). But the mapping pp +> ®(x — 2) is 


a bijection because ® is monotone; thus, the MLE of P[X, < x] is ®(x — jt), where ji is the MLE 
of 4 (which from our previous example is (1 = X). O 


Example 3.19 (Usual vs Natural Parameter in Exponential Families) 


a3 
Let X),..., X, ~ Ff, with 


F(x) = exp {oT (x)—y) + SQ)}, xe Xx 


where ¢ € ® C R is the natural parameter. Now suppose that we can also write @ = n(@) 
for 6 € © is the usual parameter, and 7 : © — ® some differentiable 1-1 mapping (and so 
y(d) = y(n(@)) = d(@), for d = y on). In this form, the exponential family density/frequency 
will take the form: 


exp {PT (x) — y(P) + S(x)} = exp {n(0)T (x) — d(@) + S(x)}. 


Now, Proposition 3.17 (p. 72) implies that if 6 is the MLE of @, then (0) is the MLE of ¢ = 
n(@). The converse is also true: if ¢ is the unique MLE of ¢, then 7—!(@) is the unique MLE of 
6 = 1 _'(@). For concrete examples, see Examples 1.24 (p. 18) and 1.26 (p. 18). Oo 


< iid 
Exercise 31 Let X),..., Xn ~ 


A on the basis of the sample. 


Exp(A), where n > 2, and let ie be the MLE of 
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1. Show that E, Ga =An jf (n — 1), and find a new estimator AU that is unbiased 
for A. Hint: use the fact that Z = )~"_, X; ~ Gamma(n, A). 


2. Show that Var, (A,) = n?A2/((n — 1)? (n —2)). 
. Does the estimator Au attain the Cramér—Rao bound? 


ww 


4. Determine the MLE 6, and the Cramér—Rao bound associated with the parameter 
6 = 1/4. Can we use Proposition 3.17? 
Compare the variance of 8, and the obtained Cramér—Rao bound. 


There are situations where differential calculus will not be applicable, and other 
approaches will be necessary. This can happen, for example, in models with discrete 
parameter spaces © or in models where the support of L(@) depends on @. If @ is 
one-dimensional, one can sometimes employ direct inspection in order to determine 
the MLE. 


Example 3.20 


Let X,..., Xn po Unif (0, 0). The likelihood is 


LO) = 0" |] 10 < X; < 0} = O10 > Xi }1{Xqy > 0}. 


i=1 


Hence if 6 < X(,) the likelihood is zero. In the domain [X(), 00), the likelihood is a decreasing 
function of 6. Hence 6 = X, (n) + O 


Exercise 32 (Minimum Likelihood) Let X be a discrete random variable taking 
the values 


0 with probability 66? — 46 + 1; 
1 with probability 6 — 267; 
2 with probability 36 — 467, 


where @ ¢€ [0, 1/2]. Determine the maximum likelihood estimator on the basis of a 
sample of size 1, X;. What do you observe? 


Exercise 33 (Conditional Likelihood) Let X),..., Xin ue Exp(A), where A > 
0. How does the MLE of A change if we somehow are told that all X; overshot 
their mean? (in mathematical terms, conditional on the event {X; > E[X;],i = 
1...,m}). Note that as with example 3.20, the support of the conditional distribution 
depends on the true parameter value 1. 
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3.3.2. Maximum Likelihood in Exponential Families 


With the exception of the uniform distribution, all the examples of probability 
models we have seen so far on the use of the maximum likelihood method are 
special cases of exponential families. It is then natural to wonder whether some 
general results can be obtained on the use of the method of maximum likelihood in 
an arbitrary parametric model that is a member of the exponential family. 

It was no accident that the MLE existed and was unique in Examples 3.14 (p. 70), 
3.15 (p. 70), and 3.16 (p. 71). This is a general phenomenon for models that are 
exponential families. We will consider here the one-parameter case for simplicity. 


Proposition 3.21 (One-Parameter Exponential Family MLE) Let 
X1,...,X, be an tid sample from a distribution with density/frequency in 
a one-parameter exponential family, 


F(x; @) = exp{eT(x)—v@)+SQ)j, xed, ge® 


with a parameter space ® C R that is an open set and T a non-constant function. 
If the MLE ¢ of ¢ exists, then it is unique, and is given by the unique solution to 
the equation 


y'(u) = T, 
with respect to u. Here, T = + Y;_, T(X;). 
Proof The likelihood of ¢@ on the basis of the sample X),..., X;, is 
L(¢) = eee 
i=l 
from which we deduce that the loglikelihood is 
€(p)= log L($)=—ny($)+)— S(Xi)+¢ D> T(X))=—ny(b) +) S(Xi)4ngT. 
i=l i=l i=l 


Since y(-) is twice differentiable, we may also differentiate € twice. Doing so, we 
obtain that 


e"(g) = —ny"(@) = —Var bp na) <0, 


i=1 
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where the last equality comes from Proposition 2.11 (p. 51). Since the second 
derivative is negative for all ¢, the function €(@) is concave. Thus if it attains a 
maximum in ® this must be the unique maximum, which proves that the MLE is 
unique. Since © is open, this maximum @ of £(¢) must uniquely solve the equation 
£'(f) = 0 with respect to ¢, or equivalently it must uniquely satisfy 


y'(¢) =T. 
oO 


Remark 3.22 (Usual Parametrisation) If ¢ = (6) for a bijection 7, then the 
MLE of 6 is also unique, if it exists, by Proposition 3.17 (p. 72). 


Exercise 34 (Cramér—Rao Bound and Exponential Families) Let f(x; 6) = 
exp(n(@)T (x) — d(@) + S(x)) be a non-degenerate exponential family such that 
¢ The parameter space © C R is open; 
¢ T(X) is not a constant function (i.e. Varg[T(X)] > 0) for all 0; 
¢ The function n : © — R is a twice differentiable injection with non-vanishing 
first derivative. 
Let X,...,Xn i f(x; 8). Suppose that the MLE On of 6 exists and has finite 
variance for all 6 € ©. Prove that 6, attains the Cramér—Rao bound (for all 6 € ©) 
if and only if h(@) = d’(6)/n'(@) is an affine function (h(@) = a8 + for some 
a,B ER). 
Hint: at some point in the proof of the Cramér—Rao theorem, we use a certain 
inequality. The Exercise 23 (p. 52) will be useful to show that On, corresponds to a 
maximum. 


3.3.3. Large Sample Properties of Maximum Likelihood 
Going back to Example 3.16 (p. 71), we recall that the maximum likelihood 


estimator for the parameter (j1,07) of a Gaussian distribution, based on an iid 
sample X1,..., Xn, is 


~ 1.2 2 -n—1l 
fin, 62) = |X,-) (X%;-XY] =(X ale 
(An, 8) ( =a °) ( —— 5) 


i=1 


where S? = — )~?_,(X; — X). Using Proposition 2.7 (p. 45) and Corollary 2.8 

(p. 48) we thus have a complete description of the probabilistic behaviour of these 

estimators: 

¢ The MLE of p, fin, is unbiased for all n. Its distribution is normal for all n, 
with variance o7/n. Therefore, the mean squared error is, in fact, exactly 07 /n, 


regardless of the vale of ju. 
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* The MLE of o”, a is biased for all n. By Corollary 2.8 (p. 48) its bias is equal 
to: 


bias(6? 


07,0 


~ n—1 n—1 1 
a”) = E[é?]— 0? = iI s*|-0? = c= S07. 
n n n 


Therefore, 67 underestimates o”, though asymptotically the bias reduces to zero. 
The distribution of 6? is the same as that of a chi-square random variable, 
multiplied by o7/n. That is: 


n 
a2 2 
o2 5 On ~ Xn-1: 


Consequently, the mean squared error of 6? is 


2(n—-1)o* (2n—1)o* 
n? 7 n2 j 


g, 2 
MSE(62,07)=bias?(6?,07)+Var[67]= (-< ) + 


Exercise 35 Let X\,... An fe N(p, a”) where both parameters are unknown 


(n > 1). We can estimate o? by S? = — ya YG - X)*, or by the MLE given 

by G2 = (n —1)S?/n. 

1. Which of the two estimators is preferable in terms of mean squared error? 

2. More generally, consider estimators of the form as?, where a € R. Which is the 
optimal choice of a in terms of mean squared error? 


We can gain a visual understanding of the behaviour of the MLE in the Gaussian 
case by looking at Figs.3.1 (p. 78) and 3.2 (p. 79). These illustrate the sampling 
fluctuations of the MLE around the true parameter value, and how these change as 
the sample size increases. We note that, as n increases, the realisations of the MLE 
concentrate more and more around the true parameter values. This is no accident: 
the mean squared error both for fi and for 6? is decreasing in n, with a limit of 0 as 
n — oo. Therefore, both estimators are consistent (recall Lemma 3.6, p. 63). 

The normal case is special in that we can determine the exact sampling distribu- 
tion of the maximum likelihood estimator and determine the mean squared error for 
every n. This gives us all the information we need in terms of the performance of 
the estimator. 

Unfortunately, we are not always as lucky with models other than the normal 
distribution. The exact sampling distribution of the MLE is often not available, nor 
is the exact value of the MSE. As we saw in Sect. 2.4, when we cannot determine a 
sampling distribution exactly, we need to resort to approximations using the notion 
of convergence in distribution. In fact, we saw that for one-parameter exponential 
families the approximate distribution of the natural sufficient statistic T,, is normal 
(Corollary 2.24, p. 56). Since the MLE in a one-parameter exponential family is 
given by the solution of an equation involving T,, (see Proposition 3.21, p. 75), 
one might conjecture that perhaps the asymptotic distribution of the MLE in a one- 
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Fig. 3.1 Illustration of the random fluctuations of the loglikelihood function and its maximum 
(the MLE). We consider the estimation of the mean jz of a normal distribution with a known 


variance equal to 1. We generate 25 iid samples of size n, say {Xi.1,..., Xiah from an N(y, 1) 
where 4 = 0, and each time plot the loglikelihood function £;(j2) = €(4; Xi1,..., Xin), where 
i= 1,2,..., 25, and the corresponding MLE. We do this for four sample sizes: n = 1, = 20, 


n = 100, n = 400. We observe how the likelihood functions become gradually more curved as 
n increases, and so their maximum fluctuates less and less from replication to replication. We also 
notice that the maxima tend to concentrate around the true value of jz as n increases. The y-axis 
values have been removed since they are unimportant in an absolute sense in the determination of 
the MLE. (a) Loglikelihood functions for the mean parameter corresponding to 25 replications of 
an iid N(0, 1) sample of size 1. (b) Loglikelihood functions for the mean parameter corresponding 
to 25 replications of an iid N(0, 1) sample of size 20. (c) Loglikelihood functions for the mean 
parameter corresponding to 25 replications of an iid N(0, 1) sample of size 100. (d) Loglikelihood 


functions for the mean parameter corresponding to 25 replications of an iid N(0, 1) sample of size 
450 
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(a) (b) 


— MLE of o? 
— The o’ 


logLikelihood 
logLikelinood 


— MLE of o* 
— Theo? 


logLikelihood 
logLikelihood 


Fig. 3.2 Illustration of the random fluctuations of the loglikelihood function and its maximum 
(the MLE). We consider the estimation of the variance o* of a normal distribution with a known 
mean equal to 0. We generate 25 iid samples of size n, say {Xi1,..., Xiah cis from an N(0, 07) 
where o2=1, and each time plot the log likelihood function ¢; (07) = (07; Xj4,..., Xin), where 
CS 2 ysciag 25, and the corresponding MLE. We do this for four sample sizes: n = 10, n = 50, 
n = 150, n = 450. We observe how the likelihood functions become gradually more curved as 
n increases, and so their maximum fluctuates less and less from replication to replication. We also 
notice that the maxima tend to concentrate around the true value of a? as n increases—in fact, as 
n increases, it looks as though the distribution of the maxima is gradually becoming symmetric 
around a”. The y-axis values have been removed since they are unimportant in an absolute 
sense in the determination of the MLE. (a) Loglikelihood functions for the variance parameter 
corresponding to 25 replications of an iid N(0, 1) sample of size 10. (b) Loglikelihood functions 
for the variance parameter corresponding to 25 replications of an iid N(0, 1) sample of size 50. 
(c) Loglikelihood functions for the variance parameter corresponding to 25 replications of an iid 
N(O, 1) sample of size 150. (d) Loglikelihood functions for the variance parameter corresponding 
to 25 replications of an iid N(0, 1) sample of size 450 
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parameter exponential family is also normal (because, if the solution of the equation 
depends smoothly on 7,,, then the delta method (Theorem 2.27, p. 57) could be 
invoked). This is indeed the case. 


Theorem 3.23 Let X,...,X, be an iid sample from a distribution with density 
(or mass function) f(x;¢0) which belongs to a non-degenerate one-parameter 
exponential family, 


I(x; b) = exp{oT (x) — y() + S(x)}, xeXx, ge, 


such that T is not a constant function. Assume that the parameter space ® C R 
is an open set (recall that, among others, this implies that the function y(-) is 
twice differentiable). Let bn be the maximum likelihood estimator of ¢o, assumed 
to exist. Then, 


1 
0 < ———_ <©o 
y" (Po) 
and 


A d 1 
= N{0, ——}]. 
Jn(¢n — 60) —> ( as] 


Remark 3.24 (Non-Degeneracy) To say that a distribution is non-degenerate 
(as the theorem requires) means that it does not assign probability | to a single value 
xen, 


Proof of Theorem 3.23 Under the conditions of the Theorem, Proposition 3.21 
(p. 75) implies that the MLE @,, is unique for all n. Furthermore, Proposition 2.11 
(p. 51) implies that 


1 n 
y'($) = 7, var bs nx) € [0, co) 
i=l 


which proves that 0 < oy < oo for all ¢ € ®. To prove that y’”(¢) > 0 (strict 
inequality) we remark that it must be that Var(7;) > 0. Because if Var(T;) = 0, 
then P[T; = E(T;)] = 1 (by Chebyshev’s inequality, Lemma A.4, p. 159) which 
means that X; is almost surely constant or 7(-) is a constant function on ¥. Either 
of these contradicts our assumptions (that the exponential family in question is 
non-degenerate and that T is non-constant). We thus conclude that 0 < <0Oo 
for all @ € ®. 


i 
yg) 
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Now, since ® is open, the unique maximum bn of £(f) must uniquely solve the 
equation ¢’(@) = 0 with respect to ¢, or equivalently it must uniquely satisfy 


y' (on) = T. 


Since y’ is continuously differentiable (by assumption (1)) and we’ve shown that 
y” > 0, the inverse function theorem* implies that there exists an open ball of 
radius « > 0 centred at y’(¢o), say Be(y'(bo)) = {y € R: |y — y'(do)| < €}, 
such that g(-) = [y’]~!(-) exists on B-(y’(@o)) and is itself differentiable with a 
continuous first derivative, which is in fact given explicitly by 


1 


F _ n-ly = 2 = 
gy = tT FO) = aaEaiGy = yO)’ 


By convention, we may define g to be zero outside of B.(y'(¢o)). 
Now, Corollary 2.24 (p. 56) implies that? 


Val — y'(o)) & NO. 7"(Go)). 


If we define bn = g(T), then delta method (Theorem 2.27, p. 57) implies 


Vi($n — $0) = Va(g(T) — gy (Go) & N (0, y"(Go) x [g’(y'(bo))]’) - 
But, by the inverse function theorem, since g(y) = [y’}7!(y), 


1 1 
y"(g(y'(go))) vo) 


s'(y) = = g'(y'(¢o)) = 


1 
y"(g(y)) 


and so we conclude that 


Jin — 00) 5 N (0. =) 


To complete the proof, suppose that we can show that 


Vign . dn) > 0. 


Recall the inverse function theorem: let h(x) : R — R be continuously differentiable, with 
a non-zero derivative at a point x, € R. Then, there exists an ¢ > 0 such h! exists and is 
continuously differentiable on (A(xo) — €, A(xo) +), and in fact (h—!)/(y) = [h’(h7!(y))|7! for 
ly — h(x0)| <e. 

5Remember: since T is the sum of the iid terms 7),..., T,, each satisfying Var(T(X;)) = 


y” (¢o) < 00 and E[T(X;)] = y’ (do), so the central limit theorem implies ./n(T — y’(¢0)) ne 
NO, y” (G0). 
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Then, Slutsky’s theorem (Theorem 2.26, p. 57) will imply that Jn(dn — 0) a 
N (0, orig) and prove the theorem.® Note, however, that 


iq € B.(y'(¢0)) bn = bn Jn(dn = dn) _ 0, 
because bn =g(T)= bn when T € B(y’(¢o)). Therefore, if 5 > 0, 
Vion —bn| > 8 => T ¢ Bey'(bo)) => IT -y' (Go) > €, 


and consequently, 


P[ Valen — bn| > 5] < PIT — y'(G0)| > €]'—> 0. 


The convergence to zero follows from the weak law of large numbers.’ This proves 
that /n (bn — dn) +. 0 and completes the proof. Oo 


Remark 3.25 (Asymptotic Variance and the Cramér—-Rao Bound) The the- 
orem can be interpreted as saying that, for large n, the MLE @¢ is approximately 
N(@o, [ny (¢0)]'). We notice that the asymptotic mean of the MLE is equal to the 
true parameter, so that the asymptotic bias is zero. Furthermore, we note that 

0 
dg 
. 2 
= | (EX, Xn) = ny") | 
Var[t(X1,..., Xn)] 
= ny"(9). 


2 
(((@))] = E (@t(Xi,-..,Xn) - nv(@))| | 


Now recall the Cramér—Rao lower bound (Theorem 3.9, p. 65) on the variance of 
an estimator. It stated that no unbiased estimator can have variance lower than the 
inverse of the left-hand side of the equation above. But we have just proved that the 
inverse of the right-hand side is the asymptotic variance of the MLE. It follows that, 
at least for large sample size n, the maximum likelihood estimator of @ attains a 
performance which is close to optimal. This explains why the method of maximum 
likelihood is so central to point estimation. 


®To see this, use Slutsky’s theorem with X, = JVn(bn — do), Yn = Jn (bn a On) and the 
continuous mapping being (X,,, Yn) > (Xn + Yn). 


7Since T is the mean of the iid terms T(X}),..., T(X,,), each satisfying Var(T (X;)) = y’”’ (do) < 
oo and E[T(X;)] = y’(@), the Law of Large Numbers implies that T 2 y' (do) 
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Corollary 3.26 (Consistency of the MLE in Exponential Families) In the 
same setup and with the same conditions as in Theorem 3.23 (p. 80), we have 


a p 
gon — ho, asn —> oo. 


Proof Define Y, = n~'/2, X, = /1(¢n—¢o) and g : RxR > Ras g(x, y) = xy. 
Then, Theorem 3.23 (p. 80) combined with Slutsky’s theorem (Theorem 2.26, p. 57) 
imply that 


g(Xn, Y,) = (bn = go) s 0. 


Consequently, Lemma 2.20 (p. 55) implies that (bn — do) . 0 and the proof is 
complete. Oo 


Notice that y’”(¢) = —€”(@) is minus the second derivative of the loglikelihood. 
Even though the loglikelihood is a random function, in the case of an exponential 
family its second derivative is a deterministic function of @. What is the interpreta- 
tion of this function? Recall that the second derivative of a function at some point 
oo describes the curvature of the function at that point. Therefore, Theorem 3.23 
(p. 80) tells us that the asymptotic variance of the MLE is related to the curvature of 
the loglikelihood at the true parameter ¢9. A moment of thought should reveal that 
this is quite intuitive: the more flat the loglikelihood is around the true parameter, 
the more “uncertain” its maximum will be: a small perturbation in the loglikelihood 
(e.g. due to sample variation) will yield a large perturbation of its maximum, due to 
flatness (hence high variance). On the other hand, if the loglikelihood is very pointy, 
we expect that the maximum will not change very much when the loglikelihood 
is perturbed (low variance). This phenomenon is clearly visible in Figs. 3.1 (p. 78) 
and 3.2 (p. 79), where we can see that the dispersion of the MLEs reduces as the 
curvature of the loglikelihood increases. 

The asymptotic distribution for the usual parameter 6 = n—'(@) of an exponen- 
tial family will now follow as a corollary. 


Corollary 3.27 Let X,,...,X;, be an iid sample from a distribution with density 
(or mass function) f(x; ) which belongs to a non-degenerate one-parameter 
exponential family, 


F(x; 8) = exp{n(0)T (x) — d(@) + S(x)}, xEX,0€0 


Assume that: 

1. The parameter space © C Ris an open set. 

2. The function n(-) is a twice differentiable bijection between © and ® = 7(@) 
with non-vanishing derivative. 
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3. The function T (x) : & — R is not a constant. 
Let 6, be the maximum likelihood estimator of 09, assumed to exist. Then, 


, [n'(Go)] ) 
© dnt Bo) — 4"(G0)1"(9) ) 


VitB, ~ 6) > W ( 


Proof Let ¢ = (0) and y(¢) = d(n'(@)). Then, the density/frequency admits 
the form 


expteT(x)—y(d) + SQ)j, xed, beE®, 


and the conditions of Theorem 3.23 (p. 80) are all satisfied. Thus, the MLE bn of 
do = (Oo) is unique, and satisfies 


x d 1 
1” N{0, ——}, 
PO ( co) 


1 
where 0 < ——— < oo 
A) ee ra rm 
It follows by the injective equivariance of maximum likelihood (Proposition 3.17, 
p. 72); see also Example 3.19 (p. 73)) that the unique MLE of 6 is 0, = 
n'(¢n). The inverse function theorem now implies that (7~!)/(y) exists in a small 
neighbourhood B,(¢o) of ¢o and that 


(n7')'(o) = [n(n (Go)! = 1/1’ (40). 


Let n~!(-) be equal to zero by convention outside of B,(¢). Using the delta method 
(Theorem 2.27, p. 57), we obtain 


8, — @) = Va (dn) = 17" (bo)) > (0. ). 
V(r — 80) = Vina! Gn) — 1"! (0) —> N (0. ES ry 


Note, however, that under the assumed conditions, we have shown in Exercise (23, 
p. 52) that 


d"(8)1' (80) — d'(8o)n" (8) 
[n’ (80)? 


= Var[T (X;)] = y"(bo) > 0, 


so that 


fate, =O) 8 (0 In’ o)] ) 


* d'(6o)1' (80) — d'(6o)n" (80) 
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Remark 3.28 (Asymptotic Variance and the Cramér-Rao Bound, Again) 
Notice that for the usual parametrisation, we also have that the asymptotic mean 
of the MLE is equal to the true parameter, so that the asymptotic bias is again zero. 
Furthermore, if ¢ = n(@) and y(¢) = d(7_!(@)), we note that 


ci) _ 9€(0) 9n()\7 | oe, 7 
a[(¢(0))?] = E (ae a) = (1 (6) E[(l ($))"] = (n' (0))?Varle (Xi... Xn] 


d” (8)n' (8) — d’(8)n"(8) 
[n’(@)]° 

= no rn) — a0") 

[n’()] 


= n(1/(0)) 


where we have used the same calculation as in Remark 3.25 (p. 82), and our result 
from Exercise 23 (p. 52). The inverse of the LHS is the Cramér—Rao lower bound 
(Theorem 3.9, p. 65). The inverse of the RHS is the asymptotic variance of 6. We 
thus see that the MLE of 6 attains a performance that is close to optimal for n large. 


Remark 3.29 A conclusion similar to that of Theorem 3.23 (p. 80) is in fact valid 
for a much broader class of distributions than just the exponential family. Under 
smoothness conditions on the density/frequency of the model, under analytical 
conditions enabling differentiation under the integral, and if the MLE 6 of 6 is 
unique, one can show that 


Jn(G, — %) > N (0, J2()/1(60)) . 


where I(6)) = E[(€’(0o))*] is the Fisher information and J(6)) = —E[é”(6)]. In 
fact, when we can differentiate under the integral, it is an easy exercise to show 
that /(6) = J(@), and so the asymptotic variance becomes 1//(6o), attaining the 
Cramér-Rao bound. 


Exercise 36 In the context of Corollary 3.27 (p. 83), prove that 


y | ag HR FM Xu) =0, and 


00 
a . oy 
a (5 toe Pi. Kni8)) =— | ape FM kn). (3.2) 
Conclude that 


da" (8)n' (8) — d'(8)n" (8) 
[n’()] , 


1(0) = J(@) = 
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Exercise 37 Let f(x; @) be a regular parametric model (not necessarily an expo- 
nential family) such that 


X={xeER: f(x; 8) > 0} 


does not depend on 6, and f twice differentiable with respect to @. Let 
X1,...,Xn is F(x; 6). Show that the equality in 3.2 is equivalent to a regularity 
condition allowing us to interchange integration and differentiation. Reminder: for 


any g:R"’ —R, 


i[g(X)] = i g(x) f(x; @)dx when the integral exists (x = (X1,...,X,) € R"). 
xn 


Exercise 38 We now consider two examples that fall outside of the realm of 

exponential families. 

1. Let X1,...,X, ‘~ Unif (0,6), where @ > 0. Let 0, be the MLE of 6. Find a 
sequence of real numbers a,, such that a,,(@ — On) converges in distribution to a 
non-degenerate limit (i.e. not a constant or infinity). 

2. Consider he the estimator from exercise 33, p. 74. Find a sequence of real 
numbers a, such that a, oe — i) converges in distribution to a non-degenerate 
limit. 

Hint : Show that if X ~ Exp(A), then Y = aX ~ Exp(A/a) fora > 0, 
then use Exercise 8 (p. 13). 


3.3.4 Other Estimation Methods 


In some situations, the MLE will not be determinable as an explicit function of the 
data. In these cases, one may need to numerically evaluate the MLE. 


Example 3.30 (Cauchy MLE) 


Suppose that X1,..., X,, are iid random variables following the Cauchy distribution with density 
function 


ie es 1 
(GO> saa SC 


The log likelihood function in this case is 


n 


€(4) = — }“ log[1 + (X; — 9)7] — n log(). 
i=l 
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This is differentiable, and so if 6 is a maximum of £(0), it must satisfy €’ (6) = 0, or equivalently 


n 


3 4xi- 6) _ 


fa 1+ (Xi -— 6) 


The equation above cannot be explicitly solved to readily give us the form of the maximum 
likelihood estimator as an explicit function of the data. Therefore, the estimator remains implicitly 


defined. For any concrete sample X} = X1,..., Xn, = Xn, we will need to solve the equation 
oer rere = 0 by some iterative/approximate solution method in order to get the numerical 
value of the maximum likelihood estimate. O 


There are several numerical methods that one can employ in order to calculate 
the value of the maximum likelihood estimator in a specific sample (that is, in order 
to calculate the estimate). Among these, chief are the Newton—Raphson method, the 
method of bisection, the method of gradient descent and the EM-algorithm. Which 
one is most appropriate depends on the specific example. What is common to all 
of them is that they are iterative: they start at a given input value and iterate some 
operation until a convergence criterion is attained. Since the function £’ might not 
be monotone (and so may have multiple roots) it is important that the starting input 
value 8.) be within a reasonable distance of the true maximum (for example, in 
Example 3.30, p. 86) we have a non-monotone ¢’); otherwise, the algorithm may 
converge to a root that does not correspond to the maximum. 


Example 3.31 (Newton-Raphson Iteration) 


We consider the general idea behind the Newton—Raphson iteration. We wish to solve the equation 
£’(@) = 0, but cannot do so explicitly. Suppose that we somehow have a starting value 69) that is 


close to the true maximum 6. Since 6 is the overall maximum, it satisfies ¢’ (6) = 0. Now assume 
that £ is smooth enough that we can carry out a Taylor expansion. We will have (Theorem A.1, 
p. 159): 


m x ae k la a 
0= (6) = (A) + (0 — A)” (Ay) + 30 — 6)? 0” (Ox), 


where 0 = A0+ (1— 16) for some A € [0, 1]. Now assuming that 6 _ 6o)| is small, the term 
(6- Bo)? is negligible relative to the term (6- 8). So, provided €’” is bounded, we may write 


Go) + (6 — 8)” (Go) = 0 
which suggests 


_f (G0) 
£”(8@)) 


Ox 40) 
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v@ 
Now, the procedure can then be iterated by defining 6a) = — 6) — me fa - , then 69) = — Bay— mae. 
(1) 
and so on. This iteration will eventually lead to convergence. Guarantess on the convergence and 
how rapidly this will occur will depend on the specific form of £. O 


How can we find a reasonable starting value 9)? In some cases, reasonable 
starting values may be found by direct inspection. 


Example 3.32 (Cauchy MLE, Continued) 


Notice that the density f(x; 0) is symmetric about 6, 


1 
30) = ————_, ER. 
f(x; 9) +o) x 
A potential starting value for @ is thus the median of X1,..., X,. This could be employed as to 
initialise a Newton—Raphson iteration. O 


In other cases, things may not be as clear. 


Example 3.33 (Gamma MLE) 


Let X1,..., Xn i Gamma(r, 1) and suppose we wish to estimate the parameter r by the method 
of maximum likelihood. The likelihood is 


n 


Le) =|] 


i=1 


X!—le —X; 


P(r) 


with corresponding loglikelihood 


(ir) = —nlog T(r) + (r — 1) Slog X; _ pee 
i=1 i=l 
Differentiating and setting the loglikelihood equal to zero, we find that the MLE 7 must satisfy, 


me) _ 
TA ~ LY og Xi 


i=1 


This equation cannot be solved explicitly. Worse, even, there is no immediate plausible value for r 
by simple inspection of the form of the density. In this case, we need some other way of determining 
a starting value for a Newton—Raphson iteration. O 


To address the issue of finding general methods for the determination of starting 
values bo), it is useful to have estimation methods that will yield some explicit 
estimates that could be used to initialise iterative techniques in search of a maximum 
likelihood estimate. These methods do not necessarily need to be as efficient as the 
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method of maximum likelihood, but at least produce estimators that are reasonably 
good. A widely used such method is the method of moments. 


3.3.4.1 The Method of Moments 

Consider first a one-dimensional problem, where {fg : 6 € ©} is a one-parameter 
regular model, © C R, and Xj, .., X;, is an iid sample generated by a true parameter 
0 € ©. The method of moments is motivated by the following heuristic. Assuming 
’[| X1|] < co, the law of large numbers tells us that 


1 n 
~~ X; > E[Xi]. 
n 


i=1 


But E[X\] = | sees xf (x; @)dx depends on the unknown parameter 0, so we can 


write that E[X,] = m(@) for some m. Rephrasing, we have 


1 n 
— > X; a m(@) 
n 


i=l 
and so, in other words, we expect that, for n large enough 
1 n 
— > xX; ~ m(8) 
n 
i=l 
for @ the true parameter. So, if 6 is to be close to 0, we expect that it should satisfy 
1 n 7 
— > X; ~ m(6). 
n+ 
i=1 
This motivated the method of moments: 


Definition 3.34 (Method of Moments Estimator: Single Parameter Case) 

Let X,,..., X;, be an iid random sample from a distribution Fg with density (or 
mass function) f(x; 6). Assume that E|X1| < oo forall @ ¢ © CR. Let 6 be 
such that 


. X; = m(6), 


i=1 


where 


+00 
m(0) = / xf(x;@)dx, OER. 


(oe) 
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Then @ is called a Method of Moments (MoM) estimator of @. 


In other words, the method of moments says that we should equate the theoretical 
first moment with the observed empirical first moment. This will yield an equation 
with the unknown being the parameter we wish to estimate. Solving this equation 
for the unknown will yield an estimator of 6, the Method of Moments estimator. 
The thing to note here is that this equation is typically easier to solve than an MLE 
score equation, because the data have been separated on one side of the equation 
(yielding a single numerical constant given the observed sample) and the function 
of the parameter on the other side. So rather than having an equation of the form 


g(X,..., Xn, 9) =0 
we have the easier problem of the form 
g(8) =h(X,..., Xn). 


Here is an illustration of the technique in a simple example. 


Example 3.35 (Uniform MoM Estimator) 


re 
Let X,..., Xn ~ Un if(0, 6), and suppose that we wish to estimate 9 € R. In this case we 
have a single parameter, so that the MoM estimator of 0, say 6 must be such that 

1 n 

= > X; = m6). 

n 4 


i=1 


In this case, 


6 
x 6 
6)= —dx=-. 
m(0) i 7 x A 


Therefore, the method of moments estimator is 


6= 


Siw 


yx. 
i=l 
O 


In case we need to estimate multiple parameters, say 0 = (0,..., 0,)', then the 
method of moments asks that we equate the first p empirical moments to the first p 
theoretical moments and obtain a system of p equations with the p parameters as 
unknowns. Solving this system will yield an estimator for 6. 
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Definition 3.36 (Method of Moments Estimator: Multiparameter Case) 

Let X,,..., X;, be an iid random sample from a distribution Fg with density (or 
mass function) f(x; @). Assume that E|X)|? < oo for all 9 ¢ © C R?”. Let 6 be 
such that 


1 n ~ 
oy A Sey BS Tposeed 
n * 


i=1 
where 


+00 
m(@) = [ x* f(x: 0)dx, OER’, k=1,...,p. 


CO 


Then @ is called a Method of Moments (MoM) estimator of @. 


The following example illustrates a two-parameter situation where the Method of 
Maximum likelihood does not yield explicit estimators, but the Method of Moments 
does. 


Example 3.37 (Gamma MoM Estimator) 


a ; 
Suppose that Xj,..., X, ~ Gamma(r, A) and we wish to estimate the parameter vector (r, A)". 


The first two moment equations are 
l n nk I n . ad 
— > xX; =m (FA) and =) x; =m)(r,d). 
nisl Vel 
But we have seen that 
m(r,A)=r/A and soma (r,A) = E?[X] + Var[X\] = r7/a? + r/A? = r(r +:1)/27. 


Solving the system of moment equations with respect to the unknown parameter yields the 
estimates 


re nX2 d FT nX 
r= an = a: 
eae = xP Via (Ki — X)? 
O 
Exercise 39 Let X,,..., X,, be an iid sample from the density 
303x-*,  ifx > 8, 
f(x; 9) = 


0, otherwise, 
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where 0 > 0. : 
1. Find the method of moment estimator oe of 0. 


. Find the ML estimator gMv of 6. 


2 
3. Show that Mom is unbiased, but @Mv is biased. 
4 @MoM and 6MY. Which estimator would one 


. Calculate the mean squared error of and 


prefer? 


A drawback of the Method of Moments is that it is not guaranteed to always 
work. To even be able to define the procedure, we require the existence of a p-th 
absolute moment in order for the method to work in a p-parameter problem. If such 
a moment does not exist, the method fails. 


Example 3.38 (MoM Failure in Cauchy Case) 


Let X1,..., X, be iid random variables following the Cauchy distribution with density function 


1 
I05 9) = Tar Go’ xeER. 


Notice that 


oat [™ - ant * ath *—dx=—oco+00 (undefined) 
m( eis Te x pal ee ee x an are x undefine 


Therefore, the moment equations are undefined, and no MoM exists. oO 


In general, when the moment generating function exists, then the method of 
moment is well defined, regardless of the dimension of the parameter. Still, there 
can be no guarantees that the system of equations produced will always have a 
solution. We will not further pursue conditions under which this could be enforced 
to be the case. 


3.4 Estimation Methods vs Estimators vs Estimates 


We conclude this chapter by a short remark on terminology, that can sometimes be 
the source of some confusion. Specifically we distinguish between the notions of 
an estimation method, estimator and an estimate. Here are some points to bear in 
mind: 

1. An estimation method is a general principle or procedure that can be applied in 
any particular parametric model in order to obtain estimators. We saw examples 
of how we can apply the method of maximum likelihood to get estimators of 
parameters in the Bernoulli, exponential, normal and uniform distributions. 

2. It can very well happen that the same estimation method produces different 
estimators when applied to two different parametric models. For example, the 
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method of maximum likelihood produces the estimator X for the mean of a 
normal distribution, and the estimator 1/X for the mean of an exponential 
distribution. 

3. It can also happen that two different estimation methods produce the same 
estimator in the same model. For example, the maximum likelihood estimator for 
the mean of a normal distribution coincides with the method of moment estimator 
for the mean of a normal distribution. 

4. An estimate is the specific value that an estimator takes when evaluated on the 
basis of an observed sample. Remember: an estimator is a random variable. The 
realisation of this random variable is called an estimate. 


Tests of Hypotheses for Model Parameters 


So far, we have considered the problem of point estimation: given a parametric 
model {Fg : 6 € ©}, and an iid sample X),...,X, issued from some specific 
Fo, estimate the value of @ that generated the sample. There are many contexts, 
however, where the precise value of the true parameter is not the primary object of 
our interest. Rather, we are more interested in using the sample to ascertain whether 
the true value of the parameter belongs to some specific subset of parameter values 
or not. 


Example 4.1 (Coin Tossing) 


For a simple example, consider a situation where we wish to ascertain whether a coin is fair, or 
is biased. We may flip the coin n times and record the outcome of each coin toss. We then wish 
to use the outcomes in order to decide whether the probability of heads is equal to 1/2 or whether 


it is different from 1/2. We could formalise this problem by saying that we have Xj,..., Xn ~ 
Bern(p) and wish to decide whether p € {33} or p € (0,1) \ {3}. O 


To make things more concrete, suppose that we know that the parameter has to lie 
in one of two sets: either in © or in ©, where ©) MN ©; = YB. We wish to employ 
the sample X,,.., X,, that we have at our disposal in order to decide which is the 
case. This setup arises very often in the sciences, where there are two competing 
scientific hypotheses. The null hypothesis Ho, that states that 0 € Oo, 


Ho: Oe Oo 
and the competing alternative hypothesis that instead postulates that 0 € ©, 


MW, :90€O4. 
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Example 4.2 (Search for the Higgs Boson) 


One of the biggest questions in particle physics in the last quarter century was whether or not the 
infamous Higgs boson exists or not. One way to detect whether this elementary particle indeed 
exists is via its decay into two photons. Using the Standard Model of particle physics, we can 
compute how many such diphoton events would be produced on average if there was no Higgs 
boson. Let’s denote this by 5. Similarly, we can also compute how many extra diphotons would 
be produced on average if the Higgs particle did indeed exist. Let’s call this s. Observed diphoton 
events are well documented to follow the Poisson distribution with some mean, say jz. Therefore, 
the null hypothesis corresponding to the state of nature if the Higgs boson did not exist would be 


A 7S b 
and the competing alternative hypothesis (describing the state of nature if the Higgs boson existed), 
Ai w=b+s. 
Oo 
The statistical problem of hypothesis testing considers how to efficiently employ 
the sample in order to decide between the two competing hypotheses Hy and Hy). 
To do this, we must first consider how one can employ the sample to this aim, and 


what sorts of error one can incur as a result. The next section introduces the relevant 
notions 


4.1 Test Functions and Error Types 


The decision between Ho and H; is to be made on the basis of the observed sample 
X1,.., X». A simple way to state this mathematically is via the following definition. 


Definition 4.3 (Test Function) 
A test function 6 is any function 6: ”” — {0, 1}. 


A test function takes the value ‘0’ when we rule in favour of Ho based on the 
sample, and it takes the value ‘1’ when we rule in favour of Hj. A test function will 
typically take the value 0 or 1 depending on whether or not the sample satisfies a 
certain condition. In other words, test functions are usually constructed by 


1, if T(X%,...,Xn) €C, 


8(X1,...,Xn) = 
OG. £PGsa 8) SC. 


where T is a statistic called a test statistic and C a set in the range of T called the 
critical region. Notice that in compact notation, we may write 


8(X1,...,Xn) = UT(X,..., Xn) € Ch. 
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Therefore the choice of the test function rests on the choice of T and of C. How 
should we make this choice in order to obtain a good test function? Notice that 6 is 
always a Bernoulli random variable, since it takes the values 0 and 1, 


1, with probability P[T(X%1,..., Xn) € C], 
0, with probability P[T(X1,..., Xn) ¢ C]. 


6= 


This Bernoulli variable may give different decisions for different realisations of the 
random sample. Therefore, just as with the problem of point estimation (where we 
needed to choose good estimators), our choice of a test function must be guided by 
a careful consideration of what types of errors one can commit. A good test function 
will then be a 6 whose sampling behaviour fares well relative to these error criteria. 

In hypothesis testing, there are two possible states of nature, and two possible 
decisions that we can make. Therefore, the “error landscape” is described by the 
following table: 


Decision/Truth Ho A, 
0 No error Type II error 
1 Type I error No error 


When the truth is Hp : 86 € @o, we hope that the distribution of 5(Xj,..., Xn) 
will concentrate around the value 0. Conversely, when the truth is H, : 6 € ©), 
we hope that the distribution of 5(X4,..., X,) will concentrate around the value 1. 
Therefore, a good decision rule should concentrate around the value i, whenever H; 
is true, fori € {0, 1}. So, by a slight abuse of terminology, we can compare decision 
rules 6 by looking at something like their “mean square error”, 


MSE(6, H;) = Eo[(6 —i)”], i € {0,1}. 


Since 6 is a Bernoulli variable and i takes values in {0, 1}, we have 


Pp[S5= 1], if 0 € Op, 


MSE(6, H;) = 
P,[S= 0], if € @). 


This motivates the following definition. 


Definition 4.4 (Error Probabilities) 


Let Ho : 8 € @o and H; : 6 € ©; be two competing hypotheses. The Type I 
error Probability is defined to be the mapping h : ©) — (0, 1], 


A(0) =P[S5=1], 6€Qd. 
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The Type II error Probability is defined to be the mapping g : ©; — [0, 1], 
g(9) =Po[s =0], O€0,. 


> Remark 4.5 That the two error probabilities are functions of @ simply reflects 
the fact that our error depends on the true state of nature: for some @ it will be 
easier to distinguish between © and ©, than for some others. For example, consider 
Oo = (—co, b] and ©; = (b, oo). For a given test function 6, we expect that it will 
be easier to get things right when the true parameter is away from the boundary 
value b, then when the true parameter is close to D. 


Remark 4.6 (Warning on Error Probabilities) Notice that h(0) 4 1 — g(0) 
since the two functions are defined over different domains. It is a common mistake 
to not realise this. 


In order to have a good test function, we must try to choose the test statistic T 
and the critical region C in such a way that the probability of type I error be small 
for all 9 € @po and at the same time the probability of type II error be small for all 
values of 8 € @,. The Neyman—Pearson framework in the next paragraph considers 
how to attack this problem. 


Remark 4.7 (Type l vs Type Il Error) It is no coincidence that the two types of 
error are given two different names, and in fact names that suggest that one kind 
of error is of primary importance (type I) and the other is secondary (type II). In 
many practical contexts, the two hypotheses are asymmetric: making one kind of 
error is far more serious than the other type of error. The more serious type of error 
is named the Type I error and the other is the Type II error. Therefore, in all practical 
situations, Ho is chosen to be the hypothesis whose false rejection is more harmful. 


Example 4.8 (Spam Filter) 


Suppose we wish an automatic test function to decide whether a new email is spam or not. The new 
message contains n words Xj, .., X;, and we need a test function in order to decide between two 
competing hypotheses: “spam” versus “not spam’. Notice that marking a message as spam when 
it is in fact not can have serious consequences (since we will not see it and it could be important). 
Marking a message as “not spam” when in fact it is spam is annoying, but perhaps not as big of a 
problem. In this context, it is reasonable to define “Ho : Message is not spam” and “H, : Message 
is spam’’. If we do so, the type I error will be precisely the probability to mark a message as spam 
when it is not. 


Exercise 40 Consider the following statistical hypothesis testing scenarios. Write 
down in each case the two competing hypotheses and the two types of errors you 
can make. Based on this, decide which hypothesis should be the null hypothesis Ho 
and which one the alternative hypothesis H,. 
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1. You are a physicist working on an experiment to detect dark matter particles. Test 
if your data contains a dark matter signal. 

2. You are trying to decide if you can drive home after attending a wine tasting. Test 
if your blood alcohol level is above the legal limit for driving. 

3. Barack Obama and Mitt Romney were the leading candidates in the 2012 US 
presidential elections. You are the campaign manager for Mr. Obama trying to 
decide how to best allocate his campaign funds. Test whether Obama is leading 
the race in the state of Iowa. How would your test change if you were the 
campaign manager for Mr. Romney? 

4. You are a scientist working at a pharmaceutical company. You have developed a 
new drug for reducing high blood pressure. Test if your drug works as advertised. 


Exercise 41 Let X),..., X, be an iid sample from an N(j1, 1) distribution. We will 
test Ho : 4 = 0 against the alternative H, : 44 # 0 using the test statistic 


T(X,...,Xn) = Xn = -~\° Xi, 


and corresponding test function 


1, if |T7,(X%4,...,Xn)| = Q, 


5(X, ee | Xn) = 
0, otherwise, 


where QO > 0. 

1. Find the probability of committing a type I error. 
2. Find the probability of committing a type II error. 
3. How do these vary as we increase 0? 


Exercise 42 Let X,,..., X, be an iid sample from the Bernoulli(p) distribution, 


with p € (0,1). We will test the null hypothesis Hp : p = 5 against the alternative 
A, : p € (0,1) \ {1/2} using the test statistic 


7 1 

Ti (X1,...,Xn) = Xn — = j , 
(X1 ) ae 5 

and the corresponding test function 


1, if |T7,(X%4,...,Xn)| = Q, 


5(X1, tee Xn) = 
0, otherwise, 


where Q € (0, 5]. 
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1. Find the probability of committing a type I error. 
2. Find the probability of committing a type II error. 
3. How do these vary as we increase Q? 


Exercise 43 (Bonferroni Correction, Multiple Testing) For each j = 1,...,J, 
let 


6 Cee ee 


be iid Bernoulli random variables with (unknown) success probability p; € (0, 1), 
and n > 1. Note that the variables are independent for fixed 7 and varying 7, but 
could dependent for fixed and varying / (e.g. X;; may be a yes/no answer of the 7th 
individual to the jth question of a customer survey). We wish to test the hypothesis 
pair: 


Meee, Vit lid 
Mt ape{lycng J}: pp< 5 


(in our example: are customers on average satisfied on all J issues, or do there 
exist issues where customers are dissatisfied, on average?). Construct a test for the 
hypothesis pair that respects a given level a € (0, 1). 


4.2. The Neyman-Pearson Framework 


Recall the closing of the previous paragraph: we must try to choose the test statistic 
T and the critical region C in such a way that the probability of type I error be small 
for all 9 € @po and at the same time the probability of type II error be small for all 
values of 8 € ©}. Is it possible to always make both these probabilities small, for 
all 6 in the respective sets Qo and ©,? 

Unfortunately, the answer is no. Here is why. Let 6(X),...,X,) = 
1{T(X%,..., Xn) € C} be a test function, and suppose that we wish to reduce 
its type I error probability, 


h(@) = Ped = 1], 8 € Op. 
over all 8 € @o. To do this, we must “reject less often”, that is, we must replace C 


by a set C,. C C and obtain the new test function 6, = 1{7(X1,..., Xn) € Cex}. 
Observe that 


Po[84 = 1] = PIT(X,...,Xn) € Cal 
PIT(X1,....Xn) €C]=Ps[S=1], VOE@o. 
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But now notice thatC, CC = > Cf D C* and so 


Po[5« = 0] = P[T(X,..., Xn) € Cx] 
> PIT(X1,...,Xn) €C] =Po[S=0], VOeEO. 


In other words, by trying to reduce the type I error, we have increased the type II 
error! By symmetry, we can also show that a similar attempt to reduce the type II 
error would inflate the type I error (for two concrete examples, consider Exercises 41 
and 42, p. 99). 

It seems that we cannot insist on simultaneously reducing the two types of errors, 
and we need to make some concessions. The fundamental premise of the Neyman— 
Pearson Framework is that since type I error is more important, we should first try 
to fix the corresponding probability of type I error to some low level. Once this is 
fixed, we can then shift focus on getting a low type II error probability. We describe 
the framework in the following steps: 


Definition 4.9 (Neyman-Pearson Framework) 
Let Ho : 8 € Oo and H; : 6 € ©, be two competing hypotheses. 

1. Fix ana € (0, 1) and call it the significance level or just level of the test. 

2. Consider only test functions 6 : Y”" —> {0,1} that respect the level, i.e. test 
functions 6 such that 


sup Pa[6 = 1] <a. 
GEQo 


For ease of reference, we call this class D(@o, a). In other words, 
D(Qo,a) = 96: X” > {0, 1}] sup Po[6 = 1] <a> . 
GEO 


3. Within the class of test functions D(@o, w), compare test functions by consider- 
ing which has lower type II error probability 


g(@) = Po[d = OJ, d€O,. 


Equivalently, one can compare test functions by considering which has higher 
power 


B(0) = 1—9(0) = Po[5 = 1], 6€@). 


The intuition behind the Neyman—Pearson reasoning is as follows: we know that 
committing a type I error is most harmful. Therefore, we must make it our top 
priority to tightly control the probability of type I error. For this reason, we must 
only consider test functions whose type I error probability never exceeds some level 
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a (usually taken to be small, e.g. @ = 0.05). Given that this restriction is satisfied, 
we can then turn to trying to minimise the type II error probability, or equivalently 
to maximising the power. 


Exercise 44 

1. In the context of exercise 41, p. 99, find the smallest value of QO for which the 
significance level is equal to some a € (0,1). Evaluate this fora = 0.05 and 
n = 10. Find the supremum (over the parameter space) of the probability of type 
II error for that value of Q. 

2. In the context of exercise 42, p. 99, suppose that n = 10. Find the values of Q 
for which the significance level is a = 0.05. What is different here as opposed 
to the first part of the exercise? Why? 


4.3. Methods for Constructing Test Functions 


Now we know what a test function is, what sorts of error we can expect to incur, and 

what properties test functions should satisfy (as dictated by the Neyman—Pearson 

framework). So it’s time to turn to the question of finding general methods for 

constructing test functions. It turns out that how one constructs a test function can 

depend very heavily on the types of hypotheses under consideration. To simplify 

things, we will consider only 1-dimensional parameters 6, and hypothesis pairs of 

the form: 

1. Simple vs Simple (Ho : 6 = 0, H; : 0 = 61, for some given 0 4 @;). 

2. Left Unilateral vs Right Unilateral: (Hp : 6 < 4, H; : 6 > 9, for some given 
Oo). 

3. Right Unilateral vs Left Unilateral. (Ho : 9 > 6, H, : 0 < 4, for some given 
Oo). 

4. Simple vs Bilateral: (Hp : 0 = 0), H, : 8 4 6, for some given 4). 

In short, we will only consider pairs of the form: 


Ho: 0 = 4 se Ho: 8 < 4% ay Hy): 8 > 4% Se Hj): 0 = 9% 
A, :6=0, A, : 06> 4 M,:0<% A, : 04% 


—$—<—S eS —_ 
simple vs simple unilateral vs unilateral simple vs bilateral 


While this may seem restrictive, it encompasses a large variety of applied 
situations. In applications, it is often sought to decide between two parameter values, 
or to decide whether a certain parameter is above, below or just deviates from a given 
threshold. 

Now, before we proceed with considering methods for constructing tests, we 
recall that in the Neyman—Pearson framework (Definition 4.9, p. 101) we set a level 
a, and consider only test functions that respect this level. In other words, we restrict 
attention to elements in D(@o,a@). Within this class, we compare test functions 
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by comparing their corresponding power functions. This motivates the following 
definition of optimality: 


Definition 4.10 (Optimal Tests) 
A test function 6 of Hp : 8 € ©o vs H, : 6 € © is called optimal at level a (or 
uniformly most powerful at level a) if the following two hold. 

. 6 € D(Oo, a). 

. Po, [Ww = 1] < Po, [6 = 1] for all 6; € ©, andall y € D(@o, a). 


Noe 


Therefore, we wish to find methods that yield tests respecting the level, and with 
as high power as possible, for as many elements in the alternative set ©, as possible. 
As it turns out, sometimes there do exist testing methods that are optimal—when this 
is the case, there is no reason to consider any other method. The existence of such 
optimal tests, though, depends strongly on the structure of @9 and ©}, and also on 
the particular probability model under study. We will therefore structure our study of 
testing methods according to the types of pairs considered. Here is an overview: 

(a) Simple vs Simple: In this case we will be able to find optimal tests that remain 
optimal regardless of the underlying model. 

(b) Unilateral: In this case we will be able to find optimal tests for specific classes 
of models, specifically for the exponential family of distributions. 

(c) Bilateral. In this case we will demonstrate that no optimal tests exist in general. 
We will nevertheless propose two general methods inspired by the concept of 
likelihood, that perform well in general. 


4.3.1 Simple Case 


In the case of a simple vs a simple hypothesis, the following result due to Neyman 
and Pearson gives us a method for constructing optimal tests. 


Lemma 4.11 (Neyman-Pearson) Let X = (X),..., X,) have joint density (or 
frequency) function fx (x; 0) and suppose we wish to test 


Hy :0 = % US H,:0=6. 
at some level a € (0,1), for 09 # 01. If the random variable 


PCG os 00.09 X08 Cy) - L(@1) 
Fx (X&,...,Xni 00) Lo) 


AQ) = 


is such that there exists a Q > 0 satisfying 


PalA > O]J=a 
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then the test whose test function is given by 
6(X) = A(X) > Q}, 


is an optimal (most powerful) test of Ho versus H, at significance level a. 


Remark 4.12 A sufficient condition for the existence of a suitable Q for any 
a € (0,1) is that A be a continuous random variable under the null hypothesis. 
If the distribution of A under Hp is discrete or has discontinuities, there may exist 
a € (0, 1) such that Pg,[A > Q] = a cannot be satisfied for any Q > 0. 


Notice the intuition behind the test: we know that the method of maximum 
likelihood is a very good estimation method. The higher the likelihood of a 
parameter, the more plausible this parameter value is as a guess for the true 
parameter. So, in order to test Hp : 6 = 9 against H, : 6 = 01, we decide to 
compare the value of the likelihood function at the two competing parameter values 
Oo and 6,. If the likelihood of 6; is significantly higher than the likelihood of 6, then 
we reject Ho in favour of H,. How much higher qualifies as significantly higher? 
The theorem tells us that Q-times higher is significantly higher, where Q is a critical 
value chosen so that the level a be respected. 


Proof of Lemma 4.11 We need to verify properties (1) and (2) in Definition 4.10 
(p. 103). Since Q is such that Pg,[A > Q] = a, then we immediately have that 


Pa [6 = |] =a (since Pg,[6 = 1] = Pa [A > Q]). (4.1) 
Therefore 6 € D({6}, w) (i.e. 5 indeed respects the level aw) which yields (1). 

To show (2), let yw € D({Oo}, w). For notational ease, write (X),...,X,)' = X 
and (X1,...,%,)' = x. Without loss of generality assume that fy is a density 
function (otherwise replace any integrals that follow by sums), and observe that 

f(x: 01) — O- f(x:%) > Oifd(x)=1 & f(x: 6;)—O- f(x: %) < Oif d(x) =0. 
Therefore, since yy can only take the values 0 or 1, 


WO) F eH) ~ O- fs 0a)) < BC) Pe: 1) ~ O Flas) 
[wencree6)— + fersdonds = acy feer61) — @- Fri 6))ax 
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Rearranging the terms yields 


/ (W(x) — 8(x)) fle: 6)dx < O i (W(x) — 8(x)) f(x: dx 
an xn 


=> Es, [W(X)] — Eo, [6(X)] < Q (Ea lw (X)] — Eo [6(X)) 
= Pa [w(X) = 1] - Po [6(X) = 1) < O Palw(X) = 1 — Pa [8(X) = 1) 


Equation (4.1), combined with the fact that y € D({6o}, @) and Q > 0, implies that 
the right-hand side is non-positive. This proves (2) in Definition 4.10 (p. 103), and 


thus completes the proof. Oo 
Example 4.13 
Let Xj,..., Xn a Exp(A) and let A; > Ao be two constants. Consider the problem of testing the 
hypothesis pair: 

Ao: A= Xo 

H 1: A= A. 
The likelihood is 


AW coooeXyih) = Fae ware 


i=1 


So according to the Neyman—Pearson lemma we must base our test on the statistic 


Xia n n 
AQ yes X= a aon = (3) o0| do— a) x | 


i=1 


rejecting the null if A > Q, for Q such that P,,[A(X),..., X,) = Q] = @, provided such a Q 
exists. To determine whether it does exist, and if so what it is, we note that A(X),..., Xn) isa 
decreasing function of T(X,,..., Xn) = oj=1 X1 (since Ap < A). Therefore 


A(X,..., Xi)> OS (X%,..., Xn) <4 
for some q, such that 


a =P,,[A = Q] = a=P,, [<(X) ye eg X,) <4) 
Now, under the null hypothesis, t(X1,..., X,,) has a gamma distribution with parameters n and Ao 
(see p. 13). Hence, there exists a g such that a = P,, [t(X1,..., X,) <q), and this g is given by 
the gy quantile of the gamma(n, Ao) distribution. 
In summary, the optimal test is to reject Ho at level w if t(X1,..., X,) is smaller than the 
a-quantile of a gamma(n, Ao) distribution. O 
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The previous example demonstrated something interesting: the test statistic for an 
optimal test reduced to the natural sufficient statistic t of the distribution (notice 
that the exponential distribution is a one-parameter exponential family with natural 
statistic T(X],...,%,) = yy x;). This is not a coincidence. It works the same 
way for all one-parameter exponential families: 


Example 4.14 (Simple vs Simple Test in Exponential Families) 


Let X1,..., Xp i f(x; 0), where f(x; 0) = exp{n(0)T (x) — d(O) + S(x)} is a one-parameter 
exponential family, with 1 being increasing. Suppose we wish to test Hy : 6 = 6p against 


A, : 6 = 6. Without loss of generality, assume that 6) < 6,. The Neyman—Pearson Lemma 
(Lemma 4.11, p. 103) dictates that we should look for a test statistic of the form 


5 = 1{L(61)/L(G) > Q} = Ulog L(,) — log L(G) > log Q}. 


By the exponential family form of f(x; 6), we obtain that 


oo 
ll 


1 | (n(61) — (60) 9» T (Xi) — nd) — d()) > log of 


i=1 


1 


57 T(x) > WEA MA) = 4060) 

= (81) — n() 

Notice that 7(0,) — n(@) > 0 since n is increasing, and n(d(0,) — d(@p)) is just a constant. So we 
can just write 


6= 1{x(X%,..., Xn) > gh. 


If t is a continuous random variable, and we want a level @ test, then q is going to be the (1 — @)- 
quantile of Go(t) = Pg, [c(X1,..., Xn) < t], i.e. the (1 — w)-quantile of the sampling distribution 
of t(X1,..., X,) when the parameter is taken to be 0 (this is called the null distribution of T). 

If, on the other hand, we have that 7 is a decreasing function, then for 0) < 61, we have 
n(@1) — n(@) < 0. In this case, we can see that the optimal test statistic becomes 


This time, if we want a level a test, then g must be the w-quantile of Go(t) = P,,[t(X1...., Xn) < 
t}. 

We observe that the form of the test depends on whether 7 is increasing or decreasing, 
and on whether 6) < 6; or 4) > 61. The following table summarises the form of the test 
statistic for the different cases. In each case, qs represents the s-quantile of the distribution 
Go(t) = Po [t(X pases Xn) < t}. 


0 < 0 O > A 
n(-) increasing 1{r(Xj,..., Xn) > di-a} 1{r(Xj,..., Xn) < da} 
n(-) decreasing 1{r(Xj,..., Xn) < qa} 1{x(Xj,..., Xn) > Qi-a} 


An interesting observation is that the test function does not depend on the precise value of 6), but 
only on whether or not 6; < or 0; > 4. O 
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It is not always the case that Go(t) = Pa,[t(X1,..., Xn) < t] is a continuous 
distribution. This means that we might not be able to find an optimal test for all a 
(we will be able to find an optimal test only for some specific w). Here is an example. 


Example 4.15 
Let Xj,..., Xn fs Poisson({) and consider the hypothesis pair 


Ho: = fo us A, ie = fy. 


Notice that this is the hypothesis pair we encountered in the Higgs boson example (Example 4.2, 
p. 96) if we set 4p = b and 4; = b +s. This is a one-parameter exponential family example, and 
it is easy to see that the sufficient statistic is 


and the 7(-) function is strictly increasing (it is equal to the log(-) function). Since 41; > jo, our 
work in Example 4.14 yields the optimal Neyman—Pearson test statistic for this hypothesis as given 
by 


8(X,,..., Xn) =1 


bes > a : 
i=1 


provided there exists a qi, such that Go(qi—«) = Pyy[t(X1,.... Xn) S di-a] = 1 — a. Since 
the X; are independent and Poisson distributed, it is a simple exercise (using generating functions; 


see Lemma A.10, p. 168) to show that t(X1,..., Xn) a Poisson(nj1o). Since this is a discrete 
distribution, the only a for which this will be the case will be 


2 2 3 
e "#0 e "40 (1 +npo),e "% (: + nyo + or) ,e "Ho (: + nyo + hey. + Ges) ) ete 


and so on (recall the probability mass function of a Poisson random variable in Definition 1.9, p. 7). 
However, an interesting observation is that as n grows, this sequence of values becomes denser and 
denser near to the origin. More precisely, for each ¢e > 0 and k € N, there exists an N € N such 
that ifn > N then there is at least k possible values of @ in the interval [0, ¢]. 


Oo 
Exercise 45 Let X),..., Xn vo N(u,07) with o* > 0 known. Find the most 
powerful test for the pair Hp : W = fo vs Hy : Lh = [Ly With flo < 4; at significance 
level a € (0,1). 


Exercise 46 Given a sample X),..., Xn i Bernoulli(p), we wish to test 
Ho: p=0.49 vs M,: p=0.51. 


Determine the (approximate) sample size for which both the probability of type I 
error and the probability of type II error are equal to 0.01. Use a test function that 
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rejects Hy when )°, X; is large. Hint: use the central limit theorem, and recall the 

last part of Exercise 44 (p. 44). You will need to use the fact that zo,.99 ~ 2.33, where 

Zo,99 is the 0.99-quantile of the N(0, 1) distribution. 

Exercise 47 Let X),...,Xn ue Unif (0, 9) and consider the pair Hp : 6 = 05 and 

A, :0= 0, with 0, < 6. 

1. Find the most powerful test of Hp against H at significance level a = (0)/60)”. 
Consider the behaviour of this level as a function of 6), 6; and n. What is the 
power of this test? Is it possible to define a Neyman—Pearson optimal test for 
other values of aw? 

2. Consider a (not necessarily optimal) test at significance level a < (6,/69)” that 
rejects Hy when X(,) < k. Find the appropriate value of k. What is the power of 
this test? 


Exercise 48 (Intuitive Hypothesis Tests) The goal of this exercise is to motivate 
hypothesis testing from a more intuitive perspective, via point estimation. Let 
X,...,X;,, be iid with density function 


1 
fy(x) = gre, x >0, 


where A > 0 is a parameter. We wish to test Hy : A = Ao vs. Hy : A = Aj, where 

Xo > Ay. os 

1. Find the maximum likelihood estimator Vg 

2. As we have seen in Chapter 3, A,, is generally a good estimator. Consequently, a 
natural approach is to reject Ho if Ao is not “compatible” with Das In our case this 
would translate to: reject Ho if ce is small. (If it were the case that an > Ao, we 
would certainly choose Ho and not H;.) What is the form of such a test function 
(up to a constant, say D)? 

3. Now let us find the precise test function. To this aim, we must determine a critical 
lower bound for a sufficiently small to reject Ho. For a given levelaw € (0, 1), 
we wish to choose the lower bound so that the probability of type I error is a. 
Describe the relationship between the constant D and the level a. 

4. We can now wonder whether this is the best test possible. Could we have done 
better, i.e. find a test at level w but more powerful yet? Show that the answer to 
this question is in the negative, by proving that our test function is precisely the 
same as that given by the Neyman—Pearson lemma (you may assume that the 
value Q in the lemma exists). 

5. Find the simplest formula possible for the test function 6(X),..., X,). Hint: vie 
involves a sum, and we know the distribution of each summand. 
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4.3.2 Unilateral Case 


In the case of a unilateral null hypothesis vs a unilateral alternative, there is no result 
similar to the Neyman—Pearson lemma that describes an optimal test regardless of 
the specific type of probability model. However, we can still find broad classes 
of models for which optimal tests can be found. We will not consider the general 
specifications of such models here, but we note that models that are one-parameter 
exponential families do satisfy these conditions. Here is the form of the optimal 
unilateral test in one-parameter exponential families. 


Theorem 4.16 (UMP Unilateral Tests for Exponential Families) Let 
X1,...,Xy be an tid sample from a one-parameter exponential family with 
density (or frequency) 


F(x; 8) = exp{n(@)T (x) — d(@) + S(x)}, xEX,0EOCR. 


where © an open subset, and 7(-) is strictly increasing and continuously 
differentiable. If t = )~’_, T(X;) is a continuous random variable, then: 


1. Fora € (0,1), the test statistic 6 = 1{t > q\—q} is uniformly most powerful 
for testing 


Ho : 0 < & 
H,:0> 6% 


at level a. Here, q\—q is the (1 — a)-quantile of Go(t) = Pa,[t < tf]. 


2. Fora € (0,1), the test statistic 6 = 1{t < qq} is uniformly most powerful for 
testing 


Hy): 0 > % 
H,:0<&% 


at level a. Here, qq is the a-quantile of Go(t) = Pa,[t < tf]. 
Remark 4.17 If 7(-) is strictly decreasing, then define 7;(-) = —n(-) and T; = 
—T. Then we get an exponential family 
f(x; 0) = exp{i(6)T; (x) — d(@) + S(x)}, xEX,OGEOCR, 
with 7\(-) strictly increasing. The theorem now applies in the same way, using 


tT = er T\(X;) in lieu of t. We can summarise the form of the test statistic, 
as dependent on the direction of the hypotheses and on whether 77 is increasing or 
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decreasing in the following table: 


Ho: 0 < % Ho: 90> % 
ey ae 
n(-) increasing 1{t(X,..., Xn) > di-a} 1{t(X1,..., Xn) < da} 
n(-) decreasing 1{t(X,..., Xn) < da} 1{t(X1,..., Xn) > di-a} 


Remark 4.18 Notice that, surprisingly, the form of the test is exactly the same 
as the form of the test in an exponential family for a “simple vs simple” hypothesis 
pair (compare the table above with the table in Example 4.14, p. 106). How is this 
possible? The key observation is that, as we saw in Example 4.14, the form of the 
Neyman—Pearson test function did not depend on the precise value of 6, but only 
on whether or not 6; < 9 or 6; > 9. It also depended on 69. This explains why 
the form of the test function in the unilateral case is the same as in the simple vs 
simple case. This is not the true in general, but is true for one-parameter exponential 
families due to their special form. 


Proof of Theorem 4.16 We will prove part (1), since part (2) follows directly 

analogously. To prove (1), we need to verify two things: 

(1) That supge(—oo,4) Pelé = 1] < @ (ie. that 5 maintains level a over the entire 
null parameter space). Note that since 6 is a Bernoulli random variable, Pg [6 = 
1] = Ep [6]. 

(II) That for any y : 4" — {0, 1} such that supge(_oo,q) Palw = 1] < a, it must 
be that 


folw] < Eo[5], VO € (4,00). 


(i.e. that 6 has maximal power over the entire alternative parameter space). 
The key to showing (I) is to show that 0 > Eg[6(X1,.., Xn)] = Po[6 = 1] is 
increasing, by showing that its derivative is non-negative. Since n(-) and d(-) are 
differentiable, f(x; 0) is of the exponential family form and 6 : ¥ — {0,1}, we 
may differentiate under the integral (see Remark 3.11, p. 67), 


0 
30 9 [6] 


a n 
Bf sea PLPC Ode ds 


i=1 


0 n 
=f 821. anag [| S01 Oda... dxn 


i=1 


_ Tar 5 9)\ O97 e 
= Bost 8e) (TE ree) 39 [| Fs aa... dx 
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= sora ( See] fess ®)) [Pei 
xn 


i=l i=1 


n 0 
= U0 Comeeaps aero) 


i=1 


= Covg sn, shay Xn)s z - log f(X;; ») 

i=l 
= Cove [5(X1,..., Xn), (n/(@)t(X1,..., Xn) — nd'(8))| 
= n'(6)Cove[6, T] 


The third to last equality comes from the fact that when we can differentiate under 
the integral sign,! 


n a 
a9 bp ag 8 FCS: ») = 0. 


i=1 


In the discrete case, we simply replace integration by summation, of course. With 
this result under our belts, we can now verify (I). Notice first that Pg,[6 = 1] = 
Palt > gi-e] = 1—Palt < gi-e]. But 1 — Pale < qi-e] = 1-— Go(qi-w) = 
1— (1 —a@) = q, since qj—, is the (1 — a)-quantile of Go. Further to this, we 
have calculated that i Po[é = l= a ‘9[5] = n’(0)Cove[s, t]. But 7'(0) is 
positive since 7(-) is increasing, and Covg[5, t] => 0 because 6 = 1{t > qi—q} is an 
increasing function of t, and is thus positively correlated with t (see Lemma A.5, 
p. 160)). It follows that a Pe [5 = 1] = 0, and so Pg[d5 = 1] is increasing. It must 
thus be that Pg[5 = 1] < Po, [6 = 1] = a@ for all 6 < 6, and the proof of part (1) is 
complete. 

To prove part (II), let 6; be an arbitrary element in (6, 00). Notice that 


a F(X, ..., Xn A) 


A:= 
F(X,..- ,Xn3 Oo) 


= exp{n(9,)t — nd(@1) — n()t + nd(O)} 
= exp{[7(91) — n(0)|t —nd(@1) +nd(%)} 


It follows that the likelihood ratio is a strictly monotone function of T, since n(-) is 
strictly increasing. Therefore, 6 is equal to the likelihood ratio test function 


1{A > exp{ln() — n@)Iqi-a — nd(1) + nd (60)}t 
Q 


'To verify this, replace 6 by 1 in the array of equations right above. 
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since 6 is | if and only if 1{A > Q} is 1. It follows from the Neyman—Pearson 
lemma (Lemma 4.11, p. 103) that 


‘a lw] < Eo,[6], VA. € (0, 00). 


for any y : X” — {0,1} such that Pg,[w = 1] < a. Note, however, that 


sup Poly = 1 <a => Palv =U <a. 
6<0 


And so, from what we have just proven, supy <q, Pa[y = 1] < a must thus imply 


Oo [W] < Eo,[5], V1 € (8,00). 


This proves (II) and thus the proof is complete. oO 


Exercise 49 A bio-imaging laboratory has developed a new method to carry out 
brain scans in less than 20 min. A sample of 12 brain scan durations from the lab is 
given below: 


X = {21, 18,19, 16, 18,24, 22, 19, 24, 26, 18, 21}. 


1. Suppose that the duration time approximately follows an N(w, 37) distribution. 
Test whether the mean scan time is less than 20 min, i.e., test Hp : f& < [Mo VS 
AL: f& > Lo with Wo = 20 at significance level aw = 0.05. 

2. Could you carry out the same analysis if the variance were unknown? 


Hint:use 6 = 1 a H0) > instil) as your test function. Here t,—1,;~» is the 
1 — a@ quantile of the Student t distribution with n — 1 degrees of freedom. 


Exercise 50 Let Y,,..., Y4 be tid N(y, 4°) random variables. We wish to deter- 

mine whether jz is larger than jo = 10. To this aim, we carry out a test at level 

a = 5% contrasting the hypotheses Ho : w < 10 and H,: x > 10. 

1. Calculate the power of the test when the true value of jz equals 13 and when it 
equals 11. 

2. Determine what number of observations we need to have to guarantee that the 
power of the test is at least 90% when the true mean is 4p = 13. 


Exercise 51 (Paired Test) A standard problem in the pharmaceutical industry is 
to determine whether treatment with a new drug will have an effect on a patient. 
Consider, for instance, the problem of reducing blood pressure, perhaps even by 
placebo effect. Let X; be the blood pressure of the ith patient before the drug 
treatment, and Y; the ith patient’s blood pressure at the end of the treatment. We 
may suppose that the X; are iid, since different patients are chosen at random. 
Similarly, the Y; are independent, since all patients received the same treatment. 


4.3 Methods for Constructing Test Functions 113 


Assume that X; ~ N(j11,07) and Y; ~ N(j12,03), with unknown o3, 03. Construct 
a test in order to test the hypothesis that the drug treatment lowers the blood pressure. 
Remark: since X; and Y; come from the same person (patient 7), we cannot assume 
them to be independent. In this context, we speak of a paired test. 


4.3.2.1 Approximate Critical Values 

Notice that, in order to be able to implement the unilateral test in practice, we 
will need to know how to calculate the quantile g, in the table above. This can 
be calculated provided that Go(t) = Pa[t(X1,..., Xn) < t] is known exactly. In 
the examples we considered (e.g. Example 4.13) this was indeed the case, but it 
will not always be the case: as we saw in Sect. 2.4 (p. 53) it is often not possible to 
determine the precise distribution Go(t). However, one can approximate it for large 
values of the sample size n. Specifically, Corollary 2.24 (p. 56) tells us that 


Jan7't(X1,...,Xn) — y'(6)) > NO"). 


or, equivalently, by Exercise 23 (p. 52) 


r dO) 4. (. a @n@—4'On") 
Yi (Week) = Figy) A (0. SOT). 


The latter suggests approximating the distribution Go(t) = Pa [t(X1,..., Xn) < ¢] 
bya 


N ¢ d'(6o) | d" Go)n 0) —) 
1! (8) ” [7 (8o) 


distribution, when n is sufficiently large. Notice that since this latter distribution is 
a continuous distribution, it follows that for large enough samples from exponential 
families, we are able to approximately construct the Neyman—Pearson optimal test 
for any level w. This can be done by the method of standardisation (Lemma 1.32, p. 
22), thus employing tables of quantiles for the N(0, 1) distribution. 


4.3.3 Bilateral Case 


Unfortunately, for hypothesis pairs of the form Hp : 6 = 0) and H; : 0 # 6p, there 
can be no optimal test, in the sense described in Definition 4.10 (p. 103). To see 
this, note that for 6 : 4” — {0,1} to be uniformly most powerful for Hp : 6 = 9% 
vs H, : 8 # Op, it must be most powerful for Hp : 6 = 6 and HM, : 6 = Hj, 
for all 0; 4 0. But consider the problem of testing such a pair, in a one-parameter 
exponential family f(x; 0) = exp{n(0)T (x)—d(0)+ S(x)}. Example 4.14 (p. 106) 
tells us that the form of the test is different depending on whether 6, > 0 or 0; < 4, 
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so there can be no optimal test: if a test is most powerful over (6, 00), then it will 
necessarily be less powerful than some other test over (—0o, 6). 

Because of this, we need to abandon the hope of uniquely determining the best 
testing method, as we were able to do in the previous two paragraphs. Instead, 
we must look for testing methods that will yield reasonably well-performing tests 
in general. We will consider two such methods, both motivated by the notion of 
likelihood: The Likelihood Ratio Method and Wald’s Method. 


4.3.3.1 Likelihood Ratio Tests 

In the previous chapter we saw that the concept of likelihood is of fundamental 
importance in the problem of point estimation. In particular, we saw that we can 
construct estimators with excellent properties if we use the method of maximum 
likelihood: choosing as our estimator the element of the parameter space that 
maximises the likelihood. 

The motivation behind the likelihood ratio test is to use the concept of likelihood 
again, but this time in order to decide between the two competing hypotheses. The 
hope is that such an approach will yield powerful tests. The formal definition is as 
follows. 


Definition 4.19 (Likelihood Ratio Test) 
Let X),..., Xp F(x; 9), yielding a likelihood 


LO) =|] £(%: 4), 


i=1 


and let Hp : 6 € @o and H; : 6 € ©, be two competing hypotheses. Define the 
likelihood ratio as 


SUPpee, L(A) 


A(X,..., Xn) = ; 
(h, ) supgeo, L(9) 


The Likelihood Ratio Test (LRT) at level a € (0, 1) is defined to be the test with 
test function 


6(X1,..., Xn) = VW{A(X,..., Xn) > QO}, 


where Q > 0 is such that supge@, Po[A(X1,..., Xn) > Q] = a, provided such 
a O exists. 


What is the intuition behind the LRT? When we had a simple vs simple 
hypothesis pair, the Neyman—Pearson Lemma (Lemma 4.11, p. 103) said that we 
should compare the likelihood evaluated at the alternative value 0; to the likelihood 
evaluated at the null value 6). When either of these sets may not be a singleton, the 
LRT method suggests that we simply compare the maximum achievable likelihood 
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from within ©, to the maximum achievable likelihood from within @p, thus 
mimicking the Neyman—Pearson lemma. 


Remark 4.20 (LRT for Bilateral Hypothesis Pairs) Note that when Ho : 0 = 0 
and H, : 0 # 6 we have ©o = {9} and ©; = R \ {6}, so, if L is a continuous 
function of @ and attains its supremum, 


Supgeo, £9) _ SUPper\toy L(P) _ Supper L(A) _ L(6) 
SUPp Ea L(@) L(@) L(6) L(@)’ 


A(X1,...,Xn) = 


where @ is a maximum likelihood estimator of 0. 


Example 4.21 
Let X1,..., Xn “ N (1,07). Assume that o? is known and suppose we are interested in testing 
the hypothesis pair 


Ay: h=po vs Mi WF Mo. 


Since the MLE of jz is X, we have 


L(X) = ( 


2no0? 


n/2 1 n _ 
) exp )-~—> (X; - xy 
oO 


n/2 n 
1 1 
L(uo) = (23) 0} 3%; = Ho)” 


i=1 


Consequently, 


= 2A) 0) 2 See ee 
AK, Xa) = TO = HD a} ee bx xP - 1 


i=1 i=1 

But we note that 
xe — joy = pae< —X¥+X—po)y= wes — XY +n(X — po)’, 
i=1 i=1 i=1 


because the cross-terms vanish. It follows that the likelihood ratio reduces to 


A(X1,...,Xn) = exp {pork = yo)| 


= 2 
It follows that A(X),..., X,,) is a monotone increasing function of S(X,,..., Xn) = (1) . 


Note that when Hp is true, S ~ y7 (recall Example 1.29, p. 20). Therefore, the likelihood ratio 
test rejects the null hypothesis if and only if S(X1,..., Xn) > Yii—w where Nia denotes the 
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1 — a quantile of the yj distribution. Notice that this is equivalent to rejecting the null if and only 
ie | Xo 
if 


o//n 


> Z1—a/2, Where z;~9/2 is the (1 — a@/2) quantile of an N(0, 1) distribution. O 


An important aspect of the Likelihood Ratio method is that it can handle 
situations where there are more than one parameters, but we are interested in 
testing a bilateral hypothesis for a single parameter. In other words, suppose that 

ne 
X,...,X, '~ f(x: 6,&), where 6 € R and & € R? are two unknown parameters. 
We might be interested in testing 


H):0=% VS A, : 904% 
at level a > 0, for some 0) € R, without making any reference to (and without 


caring about) the remaining parameter & (a parameter such as & is often referred to 
as a nuisance parameter). In this case, the likelihood ratio is formed as 


_ SUPger\{o}cer? LCF.) — suPperscerr L(O.E) L(6,€) 
suPpe{6},¢eR? L(9. §) supgerr L(40.§) super L(A, €)’ 


Miia Rd 


where (6, é ) is an MLE of (0, &). The Likelihood Ratio Test at level a € (0, 1) will 
be defined again as the test with test function 


B( Rigen Me) = BACs Fe) SO}, 


where Q > 0 is such that supzegy Pa, ¢[A(X1,-.., Xn) > Q] = a, provided such a 
QO exists. Here is the classical example: 


Example 4.22 (Bilateral Test for Means of Gaussian Distributions) 


Let X1,..., Xn “SN (44, 0), where jz and o? are unknown. Suppose we wish to test the hypothesis 


Ay: h=hMo vs Mi nA Mo 


at level a > 0, for some fixed ftp € R. Let us use the Likelihood Ratio method in order to derive 
a suitable test. We notice that we have two parameters, but are only interested in one of them. 
Following the reasoning presented above, we need to determine 


L(fi. 6?) 


A(Xq,...,X,) = ——"— 
sup,2+9 L(Ho, 07) 


(4.2) 


where (ji, G”) is the MLE of (j1, 07). For the numerator, one may calculate that 


i] n — 
_¢ 2) — b¢ 
eet Oe ag » Ho) 
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Following the same steps as in Exercise 3.16 (p. 71), we conclude that 
nly 2 
arg sup L(}19,07) = — ) 1X; — Ho)”. 
o*>0 i=1 
In other words, the supremum in the numerator in Eq. (4.2) satisfies 


1 n 
sup L(j19,0°) = L (i. * we = 19) , 


o?>0 f=1 


and so the numerator is equal to 


sup L( 0°) = : ]« re (Xi; = bo)? 
ae Onin) {Oia | | C4)? 


new! n/2 
7 E ys a 


Next, we turn to the denominator in Eq. (4.2). Recalling Example 3.16 (p. 71), we have that the 
MLE of (j1, 7) is given by the pair: 


l = l n = 
h=-) X;=X, & G=-—) (xX,-X). 
ar BoA ®) 

It follows that 


Lift 6?) = 1 a wane xy 
© 2m(1/n) Via (Xi — XP? Pl Gin i — xP 


n/2 
= ne! 
| 3S a XP | 


Consequently, the likelihood ratio is 


A(X, ..., Xn) = 


A A n n/2 
L(jt, 67) = a1 (Xi — bo)? 
suP,259 L(HL0, 67) Via (%; — X)? 


This can be further simplified by recalling that 


YC = Ho)? = DO — F + F — wo)? = OK -— KY + 0K — po)’, 
i=1 i=l i=1 


since the cross-terms vanish. Using this fact, we may write 


n/2 


wiki — X) +0 - zal 7 | 4 0 = po)? 


A(X1,..., Xn) = ym XY YL (% — X)2 


118 4 Tests of Hypotheses for Model Parameters 


Observe now that 


n(X — fo)? tite 5 a 
Tesh a 
ke=———SS :=C —S>S 
T2 IT| 


> VC, 


A>O = 


so the likelihood ratio test is 


— Ko 


S/Jn 


8(X1,..6, Xn) = 1A > OF = i{\% 


> ve}. 


and /C needs to be selected so that Py, | ite > vc| = a. But, when A) is true, we have 
that T ~ t,—1, the latter denoting Student’s distribution with n — 1 degrees of freedom (see 
Theorem 2.9, p. 48). It follows that YC = t,—1,1-a/2, where the latter is the (1 — w/2)-quantile 


of the t,— distribution. In conclusion, the LRT is 


= 1{[X — pol > t-11-a/2//n} « 
O 


Notice the intuition in this result: we will reject the hypothesis Hp : 4 = [Uo if 
X (the MLE of j1) is at a “significant” distance from jy. How large is “significant”? 
The answer is f,—1,1-a/2 times the (estimated) standard deviation of X (estimated 
by S/./n). We will see in Sect. 4.3.3.3 that we can motivate another type of testing 
method by generalising this idea. For the moment, though, we turn to consider 
another important problem in the next section. 


Exercise 52 (Bilateral Test for Variances of Gaussian Distributions) Let 
X\,...,X, be an iid random sample from a NV (1,07) distribution, where both 
pe and o? al naknows. Show that the LRT for the hypothesis pair Hp : 07 = oe 
Vs ae :o A On A at level a q is of the form WW > cy}+1{W < co}, where 
= (1/09) yy (Xi - X)?, and c; and c2 are such that cy"e = c,"e?. 
iene : Write the likelihood ratio as a function of W and investigate the form of 
this function. Remark: in practice, one usually chooses c; and cz such that Py,(W > 


C1) = Pa (W < cz) = a/2 (which is no longer a likelihood ratio test.) 


Exercise 53 (Unpaired Test) Let X,,...,X,, Yi,..., Yin be a sample of n + m 


independent random variables, where X; eo N(t1,07) and Y; o N(t2, 07), and 
o? is unknown (but the same for the X and the Y). The goal of this exercise is to 
determine the LRT for the hypothesis pair Ho : @, = 2 vs Ay: fy F Lo. 

1. Define the likelihood of the parameter 0 = (11, [42,07). 

2. Noting that ©) = {(, 4,07) : —co < pb < 0,0 < 0”? < oo} and O; = 


{([11, [2,07) 1 —0O < fly F [2 < 00,0 < o? < ov}, show that 


_) \ @rtn)/2 
e 
sup L(@) = | —- ; 
8€@o 2106, 
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where 64, = =a (Si — p+ a0; — ey, with fj = 
ste (Sle Hi + OF Vy). 


Show further that 
~ (m+n) /2 
e 
sup L(0) = | ——- , 
(aon 2106, 


where 63, = te (Dit - XP + DIL - PY). 
3. Using the fact that )~"_,(X; — fi)? = 0", (X; — ¥)? + MES and that 


(n+m)2 
5 x-Y 
ry) — B= D(Y) — FP + ma show that 


T? (n+m)/2 
7 4 


CO Cee ee eee 2c (1+ 
m+n— 


where 


mone) 


paar 


T = 
(wma [(n = 1S} + (m = 1S2] 


with S$? = 4) Dya,(Xi — X) and S? = 4 DF) - Y)?. 

4. Using the fact that the level a test with test function given by 1{A(X),..., Xn, 
Yi,..., Ym) > Q} is the same as the level a test with test function 1{|T| > Q’} 
where Q’ is such that supgee, Pa(|t| > Q') = a, determine the LRT, i.e. find 
the law of T pes Hp as well as the value of Q’. 

Hint: if A ~ y2 and B ~ Xj, are independent, it follows that A+ B ~ y? 
Theorem 2.9 (p. 48) could also be useful. 


a-+b* 


4.3.3.2 Approximate Critical Values for Likelihood Ratio Tests 

In Example (4.22) we were able to find the precise value of Q needed in the LRT 
statistic 6 = 1{A > Q}, by reducing the test statistic to an equivalent expression, 
and by using the properties of the normal distribution. This may not be the case 
more generally, though, where we may not be able to find the exact distribution of 
A (or a monotone function of it) and thus derive the exact Q. In these cases, we will 
need to resort to large sample approximations, as we have done in other cases where 
exact sampling distributions were not available. We consider the problem of finding 
the approximate distribution of A under simple nulls for one-parameter exponential 
families. 
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Theorem 4.23 Let X,,...,X,, be an iid sample from a distribution with density 
(or mass function) f(x; 0) which belongs to a non-degenerate one-parameter 
exponential family, 


F(x; 8) = exp{n(0)T (x) — d(@) + S(x)}, xEX,0EO 


Assume that: 

1. The parameter space © C Ris an open set. 

2. The function n(-) is a twice continuously differentiable bijection between © 
and ® = n(Q). 

Let 6, be the maximum likelihood estimator of 0, and 6 € © be some fixed 

element of the parameter space such that n'(@) #~ 0. If A(X%,...,Xn) = 

L(8)/L (8p) is the likelihood ratio, then 


2log A(X1,...,Xn) = 2(€(6) — €(6)) > x2, 


whenever { Hy : 0 = Oo} is true. 


Remark 4.24 (Likelihood Ratio vs LogLikelihood Difference) Notice that 
knowing the distribution of 2log A under the null hypothesis is equivalent to 
knowing the distribution of A under the null hypothesis, since the mapping x 
2logx is monotone. The result above can thus be used in order to determine the 
right critical value for a likelihood ratio test. Specifically, the likelihood ratio test 
function 1{A > Q} will be approximately (for n being large) equivalent to the test 


function 
2 
X1—a 
A> plies Pei : 


where Kies denotes the (1 — a)-quantile of the x7 distribution. In other words, for 


U2log A > 734-9} =1 


large n, the approximate critical value should be O ~ exp (ae). 


Proof of Theorem 4.23 We apply a second order Taylor expansion with Lagrange 
form of remainder (Theorem A.1, p. 159) to obtain 


2(LOn)—€(00)) = 20’ On) On—O)—€ (8 )(O,—O)? = [4 (6 )—" (62) TLV nO, —O0)7, 


where 0* is between Op and 6, and ¢’ (On) = 0 since 6 maximises the likelihood. 


It follows that |0* — 69| < 6, — |, and thus @* en 6 by consistency of 6,. 
We now consider the behaviour of the terms involved in the Taylor expansion as 
n —> oo. The continuous mapping theorem (Theorem 2.25 p. 57) implies that 


d"(0*) S d" (0) and 7" (0*) = n’ (0) (since a" and 7” are continuous at 60; 
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see Remark 2.15, p. 53). Furthermore, T ey ae d'(8o)/n' (00) by the law of 
large numbers and Exercise 23 (p. 53). Finally, by asymptotic normality of the MLE 
(see Corollary 3.27, p. 83), we know that 


Pe d n' (0) ) 1 (60) 
6, — 0) > ; = 
vit ere (0 d” (8o)1’ (8) — d’(8)n” (8) | an (80) — d’(@)n” (80) . 


for some Z ~ WN(0,1). Combining all of the above with Slutsky’s theorem 
(Theorem 2.26, p. 57) gives 


8 d”’(8)n' (80)—n"" (80) d’ (80) 1 (80) 7 
1 (80) d’’(8o)n’ (60)—d’ (8)n’’ (80) 


[1 (6 )T—a" (6%) \LV2@n — 6)] 


In other words, 2(L(On) — £(60)) = xj, since Z* ~ y7, being the square of a 
standard normal random variable (see Eq. (1.4), in Example 1.29, p. 20). oO 


Exercise 54 Let X,,..., X, be an iid sample from a Poisson distribution with 
parameter 0. We wish to test Hp : 0 = 6 vs H; : 6 # Op. Find an approximate 
likelihood ratio test for this pair of hypotheses. 


4.3.3.3 Wald Tests 

Another idea for building tests for bilateral hypotheses {Ho : 0 = 6, Hi : 6 4 0} 
is to directly use the technology that we’ve developed for point estimation in order to 
construct a test function. Suppose that we have an estimator 6 of 6. Then, we could 
compare the null value 4 to the observed value of the estimator 6(X 1,---, Xp). If 
these are separated by a “significant” distance, then it is clear that we should reject 
Ay : 6 = in favour of H, : 0 # 4. Clearly this distance cannot be expressed in 
absolute terms, as it needs to take into account the variability of 6 ; SO One Idea is to 
express this distance in terms of the variance of 6. This leads to a test statistic of the 
form: 


p = CO = Oy" 

Var(@) 
and then the test function will be 6(X1,..., Xn) = 1{T > Q}. The critical value 
Q will of course be chosen in order to ensure that the level of the test is a, in 
other words we ask that P,, [T > > Q)= = a. The problem is that Var(6) i is typically 
unknown, and so an estimator Var(6) must be used instead. Using such an estimator, 
we obtain what is called a Wald test. 


Definition 4.25 (Wald Test) 


Let X),...,X) os fC; @) and 6 be an estimator of 6 based on the sample 


X1,...,Xn. A Wald test for the bilateral hypothesis pair {Hp : 6 = 6, Mi : 
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0 ~ Op} at level @ is a test with test function 


5(X1,...,Xn) =1 


(6 — 4)? 
Var(@) - of 


(6—0)? 
Var( 6 ) 


where Pg, [ > o| = a, provided such a Q exists. 


If 6 is taken to be a maximum likelihood estimator of 6, then we have seen 
(Remark 3.29, p. 85), Exercise 36 (p. 85) that the asymptotic variance equals 


1 [n’(80)] a A 4 
n d"(8)n' (90) _ d'(0o)n" (80) nI(6) nJ(@) ; 


Therefore, we could use 


A" (6n)1! (Gn) — "(On)" (8n) 
[n'(Gn)] 


JT, =nJI(O,) =n 


instead of Var—! (6). When we use 6 as the estimator and ae instead of Var—! (6) in 
a test of this type, then we get the so-called likelihood-based Wald test. 


4.3.3.4 Approximate Critical Values for Likelihood-Based Wald Tests 

As was the case with likelihood ratio tests, we will rarely be able to find the 
critical value Q exactly. Instead, we will need an asymptotic approximation with 
respect to n. For a Wald test based on the likelihood estimator, this approximation 
can be easily obtained by using our results on the asymptotic distribution of the 
maximum likelihood estimator. We will consider, as usual, the case of a one- 
parameter exponential family. The assumptions that we will make are the same as 
those made when considering approximate critical values for likelihood ratio tests. 


Theorem 4.26 (Approximate Critical Values for Wald Tests) Let X\,..., Xp 
be an iid sample from a distribution with density (or mass function) f (x; 0) which 
belongs to a non-degenerate one-parameter exponential family, 


F(x; 8) = exp{n(0)T (x) — d(@) + S(x)}, xEX,0EO 


Assume that: 

1. The parameter space © C Ris an open set. 

2. The function n(-) is a twice continuously differentiable bijection between © 
and ® = n() with non-vanishing derivative. 
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Let 6, be the maximum likelihood estimator of 0, and Th = ipl (6,) = 
n ane Take 0) € © to be some fixed element of the parameter 
space. Then, 


ae d 
Jn (On — 60)" a nee 


whenever {Ho : 6 = 6p} is true. 


Remark 4.27 (Approximate Critical Values for Wald Tests) The result can 
now be used in order to determine the right critical value for a Wald test at level a. 
The Wald test function at level a, say 17, ( 6, — 69)? > QO}, will be approximately 
(for n being large) equivalent to the test function 


1 {Fn Gr — 8)" > Tian} 


where tes denotes the (1 — w)-quantile of the Pa distribution. In other words, for 
large n, the approximate critical value should be O =~ is 


Proof of Theorem 4.26 Under the conditions of the theorem, and when {Hp : 6 = 
} is true, we may invoke Corollary 3.27 (p. 83) to obtain 


: d [n’(60)] 
Yih) (0. mesa —aee) 


Now we may calculate that 


Le _ dn)! On) ~ dn)!" On) 
a [n’(n)] 

By our smoothness assumptions on 7 (and their ramifications on the smoothness of 

d, see Remark 2.15, p. 53), the right-hand side above is a continuous function of 6,. 

Since 6, is consistent, we may apply the continuous mapping theorem Theorem 2.25 

(p. 57) to conclude that 


7. _p, 4"(8o)n' (80) — d'(6o)0" (Go) 
n [n’ (40)] 


Combining (4.3) with (4.4), and using Slutsky’s theorem (Theorem 2.26, p. 57) we 
conclude that 


(4.4) 


1. 05s NOT. 
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Now we may take the square of the left-hand side, and use the continuous mapping 
theorem (Theorem 2.25, p. 57) to conclude that 


2 
Tn On ~ 2) = Inn = 4%)? ae 4G 


because we have seen that the square of a standard normal random variable has the 
xt distribution (see Eq. (1.4), in Example 1.29, p. 20). oO 
Exercise 55 Let X),...,X, be an iid N(O, 0”) sample, where the variance o? is 
unknown. Construct an approximate Wald test (at level a) for the hypothesis pair 
Hy : 0? = of vs H, : o? # of, for of > 0 fixed. Compare this test with the 
corresponding likelihood ratio test. 


Exercise 56 Let X,,...,X, be iid Bernoulli random variables with unknown 
parameter p. Construct an approximate Wald test (at level w) for the hypothesis 
pair Ho : p = po vs H, : p ¥ po for po € (0, 1) fixed. Compare this test with the 
corresponding likelihood ratio test. 


4.4 The p-Value 


We saw that, in the Neyman—Pearson framework, we first need to select a signif- 

icance level a, and then construct our testing procedure in a way that maximises 

power, while preserving the level aw. This yields a reasonable mathematical theory 
that can be considered to adequately address the hypothesis testing problem. 

There are, nevertheless, two non-negligible weak points when it comes to 
practical problems. They can be loosely stated as follows: 

1. Itis not always clear a priori what the “right” significance level is. Should we take 
a = 0.05, or should we take a = 0.04? It is the scientist who should suggest 
what the “right” significance level is, and then the mathematician gives the test 
function. But what if the scientist does not really know what the precise level 
should be, or if two different scientists suggest two different levels? This can be 
an issue because it might be that, for the same data, picking a = 0.05 could 
result in Ho being rejected, while picking a = 0.04 could result in Hp not being 
rejected. 

2. Suppose we are somehow able to pick a precise level w, so that we have bypassed 
the problem stated above. Once the level is set, we use the optimal test (if 
available), and then for our given data set we make a decision. Suppose we reject 
Ho at the level a. The problem now is that we have no clear indication of how 
comfortable or how marginal our decision was. For instance, would our decision 
have been different, had we selected a slightly smaller aw? 
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Fisher popularised an approach that can be thought of as the dual of the Neyman— 
Pearson approach, and that provides a means to tackle these two issues. The idea 
is that, instead of making an binary statement (ic. 6 = 0 or 6 = 1), we define 
a continuous measure of how strong the evidence in the data is against the null 
hypothesis. This measure is called the p-value. 


Definition 4.28 (p-Value) 
Let X),..., Xp a F(; 0) and Ho : 8 € Oo be a null hypothesis that is of one of 
the three following forms: 


{Ho : 6 = 6} or {Ho : 6 < 4} or {Ho : 6 = 6}. 
Let dy be a test function for Ho, of one of the two following forms: 
by (X1 peers Xn) = U{T(X, peres X,) a qi—a} or bq(X, peers X,) = UT (X, peees X,) < da}, 


where 7 is some test statistic, and q, is the z-quantile of the distribution Go(t) = 
Pa lT(X1,...,Xn) < t]. Then, we define 


P(X1,..., Xn) = inffa € (0,1) : bo(X1,..., Xn) = 1}. 
to be the p-value. 


Remark 4.29 Notice that, in all the tests that we have seen, the test function 
always reduces to one of the two forms mentioned in the definition above, though 
sometimes perhaps approximately as n — oo. 


In other words, the p-value is a random variable that tells us which is the 
smallest significance level a for which our testing method would reject the null 
hypothesis Ho on the basis of the sample X),...,X;,. Why does this quantity have 
any relevance? Because it gives us a measure of how stable our decision is under 
perturbations of a given level a: if the p-value is very small, then this means that 
we reject Hy even if we are very strict and impose a rather small @ (i.e. very small 
probability of type I error). If the p-value is relatively large, this means that we 
would only have rejected Ho if we were willing to tolerate a high probability of type 
I error. How small should the p-value be in order to decide that we have rejected ? 
The answer is left up to the scientist, who can decide depending on his/her deeper 
knowledge of the experiment at hand. Notice that this approach gives a solution to 
the problems (1) and (2) outlined above. 

The definition of the p-value seems a bit complicated, and it is natural to wonder 
whether it is possible to actually calculate it in concrete examples. This is indeed 
the case, when the null hypothesis is of one of the forms we have considered thus 
far; and, in fact, the calculation is quite easy: 
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Lemma 4.30 (Calculation of p-Values) Jn the setup given in Definition 4.28, 


we have: 
1. If by is of the form by(X1,..., Xn) = U{T(X,..., Xn) > Gi-a}, then 


D(X,..., Xn) = 1— Go(T(%,..., Xn)) 
2. If by is of the form by(X1,...,Xn) = 1{T(%,..., Xn) < da}, then 


D(X1,..., Xn) = Go(T(X1,..., Xn)) 


Remark 4.31 (Interpreting p-Values) The Lemma gives us a further way 
of understanding p-values. Let’s concentrate on case (1), where we reject for 
large values of T. Notice that 1 — Go(T(X1,..., X,)) equals the probability of 
observing something as large, or even larger than what we observed, when Ho is 
true. Therefore, when the p-value is small, we have in fact observed something that 
would be very improbable/unusual if Hp were indeed true. So we expect that Ho 
is false. A common mistake is to interpret the p-value as the probability that Ho is 
true. This is wrong, and in fact does not even make sense, because the parameter 0 
is not a random variable. 


Proof of Lemma 4.30 It suffices to prove (1), as (2) is proven directly analogously. 
In the setting (1), we can use the fact that Go is non-decreasing to write: 


By (X1, 655 X,) = 1 = > T(X,...,Xn) > dia => Go(T(X,..-, Xn)) = Go(qi-a) 


It follows that inffw € (0,1) : 64(X1,..., Xn) = 1} = 1 — Go(T(X%,..., Xn)), 
and the proof is complete. oO 


Example 4.32 
Let Xj,..., Xn SN (2, 1) and consider the hypothesis pair: 
Hy: u=0 vs AH, : 440 


We recall (see Example 4.21, p. 115) that the likelihood ratio test for this pair is given by: 


sti...) =H (Ee) 


where x7 _,, is the 1 — w quantile of the 7 distribution. Notice, therefore, that this test statistic 
conforms to the setup given in Definition 4.28. We may thus define the corresponding p-value as 


2 
2 
> Xij—a( > 


p(X1,...,Xn) =1— Gp (nX?), 
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where G 4 denotes the CDF of the 77 distribution. Observe that when X is at a large distance from 
0, then the p-value will be small. In fact, the p-value is monotonically decreasing in ¥ (note that 


G,2 is a monotonically increasing function from (0, 00) to (0, 1) because the density of a 77 is 


strictly positive over the entire interval (0, 0o)—-see Definition 1.16, p. 13). El 


One might finally ask: is there any link between Fisher’s and Neyman & 
Pearson’s approach to hypothesis tests? In the case where Go(f) is strictly mono- 
tonic,” there is a particularly simple and elegant connection: 


Corollary 4.33 In the setup given in Definition 4.28, let a € (0, 1) and assume 
that Go is continuous and strictly increasing. If we define a test function 


Ui(OGitg soon) = UK DOG 5 0000 2%) < ao}, 


then W(X1,...,Xn) = ba)(X1,..., Xn). In other words, if we reject the null 
whenever the p-value is smaller than ao, then our test reduces to by: 


Proof Without loss of generality, we assume that we are in the setup 
where the p-value corresponds to a statistic of the form 69(X1,...,Xn) := 
U{T(X%,...,Xn) > Gi-a}. Now, observe that, using Lemma 4.30, and we have: 


P(X, ...,Xn) <M > 1-Go(T(X,..., Xn)) <Q <> Go(T(X,...,Xn)) > 1-a. 


Under our assumptions, Go ' exists and is strictly increasing. Applying it to both 
sides of the last inequality yields: 


PX ices Xn) Sp SS TBs Kn) > Gy =e) SS By Ky) = 1. 
=41-ap 


oO 


It follows that the p-value is a versatile tool: reporting a p-value solves some of the 
problems that we mentioned earlier in this paragraph. Still, even when we report a 
p-value, we can still use it to implement a Neyman—Pearson type test at some level 
a, simply by rejecting whenever p < a. 


>This is not as restrictive as it may sound. A sufficient condition is that the distribution must be of 
the continuous type, with a probability density function satisfying go(¢) > 0 for all ¢. This will be 
true, for example whenever Go is a CDF corresponding to a normal, Student, or exponential family 
distribution. Furthermore, in many examples, we can approximate Go for large n by the normal 
CDF, so the assumption is again approximately satisfied, even if the exact form of Go is discrete. 
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Exercise 57 Let X),..., Xn Ps f(x; 6). Suppose we wish to test Hp : 0 = 9p vs 
AH, : 0 ¥ 4 using the test function 6, of the form 


ba(T(X1,..-, X,)) = U{T(X,..., Xn) > qi-a} or ba (T(X1,.--, Xi) = U{T(X,..., Xi) Ja}, 


where gq is the a-quantile of Go, the CDF of T(X),...,X,) when 6 = 6p. 
Assuming that Go is continuous, show that, under Ho, the p-value is uniformly 
distributed on [0, 1]. 


4.5 OnTerminology: Accepting Versus Not Rejecting 


From the mathematical perspective, the outcome of a hypothesis test is clear cut: 
0 or 1. This means that we decide between the two competing hypotheses, Hp and 
,. How do we communicate this decision in the context of an application? 

In the context of science, competing hypotheses represent competing scientific 
theories. The null hypothesis represents a scientific assertion. The alternative 
hypothesis encapsulates how we might expect the assertion to break down. 

When the outcome of the test is 0, then the empirical evidence is not sufficient 
in order to reject the null hypothesis. Does this mean that the evidence actually 
proves that Ho is true? No, it merely does not disprove Ho. For this reason, when 
the outcome is 0, we say that “we do not reject the null hypothesis Hp” instead 
of saying “we accept the null hypothesis Ho”. From a mathematical perspective, we 
can think of this in the context of necessary and sufficient conditions. If the evidence 
is such that 6 = 0, then a necessary condition for Ho to hold true (=the data being 
consistent with Ho) is not violated. This does not prove validity of Ho, it merely 
says that we cannot disprove validity of Ho given the current data set. 

On the other hand, when the test results in “1”, the interpretation is that the 
evidence does not support the null hypothesis: the data appear to be incompatible 
with Ho (we have something like a counterexample). So, we can say that “we 
reject the null”. But can we actually say that “we accept the alternative”? The 
alternative was used as a device in order to detect possible departures from the null, 
by constructing a test function that would be able to detect departures in the direction 
of the alternative. It was our “best devil’s advocate”, but it was not necessarily the 
most viable alternative scientific theory in itself. For this reason, in the context 
of scientific applications, when 6 = 1, we almost always say “we reject the null 
hypothesis Ho”, instead of saying “we accept the alternative hypothesis H,”’. 

Again, from a mathematical standpoint, things are clear: we decide 0 or 1. But 
when communicating a mathematical result to scientists, there are pitfalls due to the 
weaknesses of the verbal presentation of otherwise rigorous mathematical results. 
The language of mathematics is clear, but the verbal presentation of mathematics 
will always be less rigorous, and care must be taken. 
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In summary, the next table presents the recommended way of verbally conveying 
the result of a hypothesis test: 


Mathematical 6(X1,...,Xn) = 1 6(X,...,X,) = 0 

statement 

Verbal statement We reject the null We do not reject the null 
hypothesis hypothesis 


Exercise 58 As an example of a situation where one must be careful in phrasing 

the result of a test, we consider a more complicated scenario. Let (X,Y), be a 

random vector taking values in {1,2}*. Let (X1, ¥),..., (Xn, Yn) be an iid random 

sample distributed as (X,Y). We wish to test the hypothesis that X¥ and Y are 
independent random variables. Let p) = P(X = 1), po = P(Y = 1) and 
p3 = P(X =Y =1). 

1. Formulate the null and alternative hypotheses in terms of p1, pz and p3. 

2. Find the maximum likelihood estimators of ),, P2 and p3 on the basis of 
the sample values (x1, y1),..-,(%n,¥n) in general, as well as when the null 
hypothesis is valid. 

3. Show that if p) = p2 = 1/2 are known, we have a one-parameter exponential 
family. Test the independence hypothesis in this case, and find the approximate 
p-value for the following data: n = 1024, 11; = 266, mj2 = 231,12 = 243, 
N22 = 284 where nj; is the number k such that X; = i and Y; = j. 

Remark : There exists a test for the more general case when pj, p2, p3 are unknown, 

where the limiting distribution of the test statistic is y7, but we do not yet have the 

tools to consider this case rigorously. The test applies also when X takes k > 1 

different values, and Y takes / > 1 different values. The limiting distribution will 

be Xe-ya-w in this case. 


Confidence Intervals for Model Parameters 


Once again, let us zoom out to see the bigger picture: there is a regular parametric 
family of distributions ¥ = {Fg : 9 € ©}, where © C R, which is our model 


for a certain stochastic phenomenon. We are able to observe n independent and 


. : op its id 
identically distributed outcomes from the phenomenon, say Xj,..., Xp oe Fo 


generated for a particular choice of 6 € © C R; but the precise value 6 € © 

that generated them (the true state of nature) is unknown to us. With this iid sample 

at our disposal, we wish to make inferences about 0. So far we have made two kinds 

of inferences on the true parameter value: 

1. Point Estimation. Find the exact value of the unknown parameter 0, as accurately 
as possible. 

2. Hypothesis Testing. Given two candidate regions ©; and po where 6 might lie, 
find optimal ways of deciding in which of the two regions the true 0 resides. 

In this chapter, we will consider the third important problem of statistical inference, 

which loosely stated is: 

3. Interval Estimation. Find an interval of plausible values for 0, in the sense that 

the interval has a high probability of containing 0. 

The essence of the third problem is as follows. We know that an estimator 
A(x 1,---,Xn) of O isa random variable. Therefore, the probability that 6 perfectly 
estimates @ is either low (if bisa discrete random variable) or even zero (if isa 
continuous random variable). However, if 6 is an estimator with a low mean squared 
error, then we expect that 0 cannot be very far from our estimate 6(X 1,-..,X,). Can 
we use our estimator 6 and (approximate) knowledge of its sampling distribution in 
order to propose an interval that is highly likely to contain the true 6? Such an 
interval we call a confidence interval. 

In the next few paragraphs we will define the notion of a confidence interval 
rigorously, and we will show how we can use our knowledge of point estimation 
theory in order to construct such intervals. We will then consider the problem of how 


© Springer International Publishing Switzerland 2016 131 
V.M. Panaretos, Statistics for Mathematicians, Compact Textbooks in Mathematics, 
DOI 10.1007/978-3-3 19-2834 1-8_5 


3 


132 5 Confidence Intervals for Model Parameters 


to define “optimal intervals”. To do this, we will use an important duality between 
interval estimation and hypothesis testing. ! 


5.1 Confidence Intervals and Confidence Levels 


Let us begin with the rigorous definition of a confidence interval, and then discuss 
its elements. 


Definition 5.1 (Two-Sided Confidence Interval) 
Let X),...,X) 2, f(x; @), where 6 € © CR, be random sample and a € (0, 1) 
be a constant. Let L(X1,..., X,) and U(X, ..., X;) be two statistics, called the 


lower limit and upper limit, respectively, such that 


inf Pol L(X1,---+ Xn) <0< U(X1,....Xn)| = es 
EH 


Then, the random interval 
Hee or U(X1,....Xn)| 


is called a two-sided confidence interval for @ with confidence level (1 — @). 


Since anything we do will depend on our sample X,..., X,, any candidate 
interval we propose will in fact be a random interval that will take different values 
for different realisations of our sample. In order to be able to construct this random 
interval, its endpoints L and U will be statistics constructed from our sample. 

For the interval to truly be a likely region for the true parameter 0, we ask that 
the probability of the event {L < 0 < U} beat least as large as 1 — a, whatever the 
true value of 6 may be” for some small probability w. There are situations where we 
are more interested in giving a lower or upper confidence bound on the true value 
of a parameter 0. In these cases, instead of using a two-sided confidence interval as 
defined in Definition 5.1, we use the notion of a one-sided interval. 


Definition 5.2 (One-Sided Confidence Interval) 
Let X),..., Xn ae f(x; @), where 6 € © CR, be random sample anda € (0, 1) 
be a constant. Let L(X),..., X,) be a statistic such that 


inf Po[ LX... Xn) < 6| Ste. 
GEO 


‘Note that the problem “use the data to decide if the region © contains 0” is in some sense dual 
to the question “use the data to find a region that is highly likely to contain 6”. 


?Since this probability obviously depends on the true value of 0! 
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Then, the random interval 
Eee a) +00) 


is called a left-sided confidence interval for 9 with confidence level (1 — a). 
Analogously, if U(X), ..., X;) is a statistic such that 


inf Po[U(X1,.-., Xn) > 6| iia 
6EO 


then the random interval 
(-0o, U(K1,....Xn)| 
is called a right-sided confidence interval for 6 with confidence level (1 — @). 


We now illustrate many essential features of confidence intervals within the 
framework of the following prototypical example. 


Example 5.3 (Confidence Interval for the Mean of a Normal 
Distribution) 


a ; ; 
Let X),..., Xn iw N(w, 0), where jz is unknown and o? is known. We wish to construct a two- 


sided interval for jz. We begin by observing that by Lemma (1.32, p. 22) we have: 


Therefore, if zz and z;¢ are the a/2 and 1 — a@/2 quantiles (respectively) of the N(0, 1) 


distribution, we must have: 


IA 
>< 
| 
= 
IA 
a 
| 
wR 
Q 
| a | 
ll 
| 
R 
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<psX-zs |= =a 


Vn 


The above equality is true whatever the true value of 7 € R may be. It follows that if we set 


L(X,,...,Xn) = X —a—«# &  U(X,..., Xn) = X — ze 


2 /n 


then the interval [L, U] is a confidence interval with confidence level 1 — a. Because the density 
of an N(0, 1) distribution is symmetric, we have that ze = —z,—¢. So our (1 — @)-confidence 
interval may be written as 


(5.1) 


For brevity, we sometimes represent the endpoints of the interval as ¥ + z—¢ Th Notice that 


the confidence interval is centred around the maximum likelihood estimator of jz. It says that our 
plausible region is the MLE, plus or minus a constant times the standard deviation of the MLE 
(since o? /n is the variance of the MLE X). The constant is chosen in order to have confidence 
level 1 — a. 

We can also make some more observations. The length of the confidence interval (equal to 
2z1-0/20/./n) depends on 07, n and a. The parameter o7 is beyond our control, since it is the 
variance of the underlying N(j, 07) distribution. The two parameters that we are able to control 
are the sample size n and the confidence level 1 — a. Increasing n re-scales the length by 1/./n. 
So, for example, if we want to make the interval ten times shorter, we need to take a sample size 
that is 100 times larger. On the other hand, decreasing a (increasing the confidence 1 —@) increases 
the length of the interval: the more confident we want to be in our interval, the longer the interval 
will be (notice that the length of the interval tends to 00 as a — 0). 

We may also ask how to construct one-sided confidence intervals, in case we are interested in 
lower or upper bounds for the parameter jz. Let us consider the problem of finding a right-sided 


confidence interval. Using the fact that ada ~ N(0, 1), we may write 


X-— 
P| H > i] = 1-0. 


o//n — 


This can be manipulated to yield 


and so the interval 
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is a right-sided confidence interval with confidence level 1 — a. Similarly, we can show that a 
left-sided (1 — a)-confidence interval is given by 


faa ’ +00 ). 


In summary: 
Confidence 1 — a L(X,..., Xn) U(X,..., Xn) 
= Oo = oO 
Two-sided X — 2-9/2 = X + Z1-a/2 = 
vn vn 
Lefi-sided Fai +00 
Jn 
Right-sided eo aes 
ight-side —oo a 
g Z1 Jn 
O 
. : iid 
Exercise 59 (Normal Case, Unknown Variance) Let X),...,X, ‘~ N(u,07), 


where both y and o? are unknown. Let S? = )7"_,(X;— X)*/(n— 1), and ti} be 
the a-quantile of Student’s t, distribution (with k degrees of freedom). Prove that 
the confidence intervals given by the following table are (1 —w)-confidence intervals 


for the mean LZ. 


Confidence 1 — a L(X,...,Xn) U(X,..., Xn) 

; = S - S 
Two-sided X— U{n—1,1—-a/2} Jn X+ ln—1,1—a/2} Jn 

- S 
Left-sided X = tey—-1,1-0} = +00 
, Jn 
> RY 

Right-sided —0o xX+ Sy a 


Exercise 60 (Optimal Choice of Quantiles) In order to construct the two-sided 

confidence interval for the mean of a normal distribution (known variance) in 

Example 5.3, we chose Zy/2 and zj~g/2 as the quantiles to base the interval on. 

One can wonder why not choose Z,/3 andzj—29/3, for example. It’s true that a more 

natural choice of interval is a symmetric interval, but here is a further reason why : 

1. Let Z ~ N(0, 1) anda@ € (0,1). Show that the interval J = [L, U] of minimal 
length such that PZ 5 Z) > 1 —a is given by the choice L = Zy/2 and 
U= Z1—a/2+ 

2. Let X1,..., Xp = N(, 07) where o? is known. Find the interval J, = [An, By] 
of smallest length such that PU, 3 4) => l—-a. 

3. Can we generalise this result to the case of unknown variance? Or even to 
distributions other than the normal distribution? 
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Exercise 61 (Difference Between Means) Let X),..., X; x N(x, o”) and 
Yi,...,Yn ses N(wy, a7) be two independent samples, where jx, jy and o” are all 
unknown. Construct a two-sided confidence interval for the parameter 6 = [uy — [Wy 
with confidence level 1 — a. 


5.2 Pivots and Approximate Pivots 


It seems that the construction of confidence intervals is quite straightforward and 
indeed transparent in the case of the mean parameter of a normal distribution. 
However, it also seems that the way we proceeded in our construction was rather 
ad-hoc, and indeed specific to that particular case. How does this example make us 
any wiser in terms of constructing confidence intervals in more general situations? 
We need to find general methods of constructing such intervals. The crucial step in 
Example 5.3 was exploiting the fact that 
X—-p 
o/J/n 


This allowed us to write the probability statement 


~ N(0,1). 


X—p 
P | 2/2 < —= < Z-a — 
fans Ff ss | a 


which was valid for any value of jz. We were then able to manipulate the argument of 


the probability to get our interval. The reason this worked was that +=“ constitutes 


o]Jn 


what we call a pivot. 


Definition 5.4 (Pivot) 
Let X),..., Xp fs F(x; 8). A function 


gi: x”"xO-R 


is called a pivot if: 
1. Ob g(x1,..., Xn, 8) is continuous for all (x1,...,X%n) € Xv". 
2. Pilg(X1,..., Xn, 9) < x] does not depend on 6. 


» Remark 5.5 In other words, a pivot g(X1,..., Xn, 9) is a function of the sample 
and the parameter, but its distribution is not a function of the parameter. Notice that, 
by its very definition, a pivot is not a statistic: it depends on the unknown parameter! 
The continuity requirement will become clear soon. 
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If we are able to find a pivot for 6, whose distribution is known, then we are able 
to find quantiles q, and g2 such that 


Plai < g(X%,..., Xn, 9) <Q] = 1-a. 


If g is of a form that allows us to manipulate the inequality inside the probability 
(similarly to Example 5.3), then we are able to obtain an explicit confidence interval. 
Still, though, even if we cannot manipulate the expression, we can numerically try 
to determine the set 


{9 €O:q1 < g(M1,...,Xn,9) < qa} 


and retain this set as our confidence interval. Notice that under our continuity 
assumption (2) on g, this set may be an interval or a union of intervals depending 
on the behaviour of g. A sufficient condition to obtain a single interval is to ask that 
g be monotone in 6. But this is not a necessary condition, of course. In practice, the 
pivots that we will encounter will typically give us intervals rather than unions of 
intervals. 

Once we have a pivot whose distribution is known, then we are able to construct 
confidence intervals. However, there are two challenges that we now face: 

1. How can we find pivots in general? 
2. How can we determine the distribution of a pivot? 

The determination of a pivot (and its distribution) depends upon the particular 
probability distribution, and also on which parameter of the distribution we wish 
to construct a confidence interval for. Thus, there is no single “explicit formula’, 
and pivots are constructed on a case-by-case basis. Nevertheless, it turns out that, 
often we can answer both questions (1) and (2) with a general “explicit formula” 
by settling for what is called an approximate pivot. This means that it may not be a 
pivot for a finite n, but gradually satisfy the assumptions of a pivot as n — oo. 


Definition 5.6 (Approximate Pivot) 
Let X1,...,Xn ~ F(x; 8). A function 


g:Xt”"xO->R 
is called an approximate pivot if: 
1. Foralln EN, Ob g(X1,..., Xn, 9) is continuous for all (x1,...,X,) € ¥". 
2. We have 


Fi 6 Cee A) ee 


where Y is a random variable whose distribution does not depend on 6. 
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If we know the asymptotic distribution of an approximate pivot, we may 
construct an approximate confidence interval. How? Assume that Y is a continuous 
random variable. If we take gq; and q2 to be quantiles of Fy such that 

Pla <Y¥ <@]=1-a, 
then we have 
d n> 
g(X1,.-.,Xn,0)—> ¥ => Pla < g(X,..., Xn, 6) < qo] —> 1-a. 
We can therefore use the approximate pivot in order to build an approximate 


confidence interval. 


Example 5.7 (Mean of a General Distribution) 


Let Xj,..., X, be an iid collection of random variables with unknown mean « = E[X] and 
unknown variance E[(X; — )*] = 0? < oo. Suppose we wish to find an approximate pivot in 
order to construct a (1 — a)-confidence interval for jz. We remark that: 
- d 

* By the central limit theorem (Theorem 2.23, p. 56), we have ./n(X — 1) > N(0, o?). 
* By the strong law of large numbers (see Remark 2.22, p. 56), S? = )-7_,(X; —p)?/(n—1) = 

o*. Indeed, U? = )°_(X; —n)?/—-V +. o? and U2—S? =n(n—1) 1X — py 46, 
Combining the two facts provided above, we may use Slutsky’s theorem (Theorem 2.26, p. 57) to 
conclude that 


(X%,...,X. (oie N@,1) 
& In+++5An, ~ S/Jn oi). 


so that we have found an approximate pivot. Mimicking the manipulations carried out in 
Exercise (5.3, p. 133), we have that: 


= Ss — Ss xX 
pix — 7-4 Sms X-Z = Plax Z S Z1-«/2] 
> Jn > Vn S/Jn 


= Plta/2 < g(X1...., Xn, M) S Z1-a/2] 


n—>co 
— Plea SY Sze] = 1—a. 


It follows that the interval ¥ + Z1—4 a is approximately, for large n, a two-sided (1 — a)- 


confidence interval for jz. By similar arguments, we may construct one-sided confidence intervals. 
The results can be summarised in the following table: 
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Approximate Confidence 1 — a L(X,,..., Xn) U(X, ..-, Xn) 
= Ss = Ss 
Two-sided X= Zia) = X + Zi-a/2—= 
Jn Jn 
Left-sided xX : +00 
eft-side Sy 
ZI Ji 
Right-sided ora X+ 2 
ight-side _ Zl—¢ —= 
Jn 


Of course, in general we will be interested in parameters other than just the mean, 
so this example is rather special. In the next section we shall consider two ways of 
constructing approximate pivots in one-parameter exponential families. 


Exercise 62. Combining the reasoning in Example 5.7 and Example 5.3 (p. 133), to 
d 
show that if 7, ~ t,, then T, — Z as k — oo, where Z ~ N(0, 1). 


5.2.1 Approximate Pivots in Exponential Families 


We have seen thus far that both point estimation and hypothesis testing have some 
very attractive properties when considering one-parameter exponential families. The 
problem of interval estimation is no exception. We will see in this paragraph that it 
is feasible to find approximate pivots for one-parameter exponential families under 
very mild conditions. We will consider two types of confidence intervals arising 
from two types of pivots: 

1. Wald intervals. 

2. Likelihood ratio intervals. 

Notice that the names of these two methods highly resemble two methods we 
saw for constructing hypothesis tests. This is no accident, and we will rigorously 
investigate this connection in Sect. 5.3 (p. 141). For the moment, we determine the 
approximate pivots. 


5.2.1.1 Wald Pivots 


Proposition 5.8 (Wald Approximate Pivots) Let X,,..., Xn be an tid sample 
from a distribution with density (or mass function) f(x; 0) which belongs to a 
non-degenerate one-parameter exponential family, 


F(x; 8) = exp{n(0)T (x) — d(@) + S(x)}, xEX,0€0 


Assume that: 
1. The parameter space © C Ris an open set. 
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2. The function n(-) is a twice continuously differentiable bijection between © 


and ® = n(@). 
Let 0, be the maximum likelihood estimator of 8, and J, = nJ(@,) = 
n a!" On)ny' (On)=d" (On yn" On) ; Define 
1! (On) 
g(X1,...,Xn,0) = F126, — 8). 
Then 


eUhin Xe) N(O, 1), 


and so g(X,,..., Xn, 9) is an approximate pivot for 0. 


Proof The proof is exactly the same as that of Theorem (4.26, p. 122) only this time, 
instead of 8, we write 0. oO 


Exercise 63 (Wald Approximate Confidence Intervals) Using the same notation 
as in Proposition 5.8 above, prove that the following table indeed yields approximate 
(1 — @)-confidence intervals for 0: 


Approximate Confidence | — a L(X,...,Xn) U(X),..., Xn) 
Two-sided 6- Te 6 + nT ee 
Left-sided 6 —z-aF, 1? +00 
Right-sided —oo 6 + zg 71? 


5.2.1.2 Likelihood Ratio Pivots 


Proposition 5.9 (LRT Approximate Pivots) Let X\,..., X, be an iid sample 
from a distribution with density (or mass function) f(x;@) which belongs to a 
non-degenerate one-parameter exponential family, 


F(x; 8) = exp{n(0)T (x) — d(@) + S(x)}, xEX,0EO 


Assume that: 

1. The parameter space © C Ris an open set. 

2. The function n(-) is a twice continuously differentiable bijection between © 
and ® = n(O). 

Let 6, be the maximum likelihood estimator of 8, and 


g(X),...,Xn,0) = 2(€(8) — £(0)). 
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Then, 


d 
OG e655 2a) Te Xi: 


and so g(X1,..., Xn, 9) is an approximate pivot for 0. 


Proof The proof is exactly the same as that of Theorem (4.23, p. 120) only this time, 
instead of 0, we write 0. oO 


Notice that the likelihood ratio approximate pivot g(X1,...,Xn,@) = 2(€ (6) - 
£(@)) is not necessarily of a form that we are able to manipulate in order to get the 
explicit form of the approximate confidence interval. However, we may numerically 
find the approximate confidence interval of interest, by determining the set 


{0 € © : g(X, se pAns 0) < qi-a(x7q)}; 
where q1-a(77) is the (1 — w)-quantile of the x7 distribution. 


Exercise 64 (Exact and Approximate Pivots) 

1. Let X),...,Xn o F(x; 6) and T,,(X1,..., X,) be a sufficient statistic that is a 
continuous random variable. Let Y, = Fr, (Tn; @), where Fr, (t; 0) = Po[T, < ¢] 
is the sampling distribution function of 7,,. Show that Y,, ~ U(0, 1) and thus Y,, 
is a pivot. 

2. How can you use this result to construct a confidence interval for @, in the case 
where Fy, is known exactly? 

3. Assume that f(x; 0) = e~“—91{x € [8, 00)} (not an exponential family). Use 
part (1) and the statistic 7,, = min{X,,..., X;,} to find a confidence interval for 
6 at confidence level 1 — a. 


5.3. The Duality with Hypothesis Tests 


The careful reader may have become suspicious that there is a structural connection 

lurking between confidence intervals and hypothesis tests, while going through the 

previous paragraphs. Here are some clues that one might have picked up along the 

way: 

¢ In interval estimation, we try to find a region that will contain the parameter. 
In hypothesis testing, we are given a region and asked whether it contains the 
parameter. It seems like the two problems are dual to each other. 

¢ In hypothesis testing we have the level (the probability of falsely rejecting Ho) 
which is q@. In interval estimation we have the confidence level 1 — @ (the 
probability that the interval cover the true parameter). Is there a relationship 
between the two? 
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¢ In hypothesis testing, we constructed likelihood ratio tests and Wald tests for 
the parameter. In interval estimation, we constructed Wald and likelihood ratio 
intervals for the parameter. 

Could it be that we are looking at the two sides of the same coin? This is indeed the 

case, and it is now time to make the connection rigorous. 


Theorem 5.10 (Duality Theorem) Let X),....X, ‘f(x: 0) be a random 


sample and0 €@CR. 
1. If [L(X%,..., Xn), U(X%4, ..., Xn) is a two-sided (1 — a)-confidence interval 
for 0, then the test with test function 


S(O%g6005 2%) = Wey € VEOGieo 065 Malo WOGins ooo Sah 
is a level a test of {Hy : 0 = 9} against {H, : 0 # Op}. 

2. Conversely, suppose that given any 0 € ©, 6(X1,...,Xn;90) is a test 
function for the hypothesis pair {Ho : 80 = 0} against {H, : 0 # 6} with 
probability of type I error a. Then, 

R(X%,..., Xn) = {0 € O: 6(%,..., X30) = 0} 
is a (1 — @)-confidence region for 0. 


Proof of Theorem 5.10 We first prove part (1). It suffices to show that the level of 
the test 5 is a. But observe that 


Pa [6(X1,...,Xn) = 1] = 1— Pa [6(X1,..., Xn) = 0] 

1— Pa [L(%,...,Xn) < 0 < U(X,..., Xn)] 

Ss 1— inf Po[L(X1,..., Xn) = 6 = U (Xj) 2-25 Xn)] 
E 


II 


1-—(l1-a) 


= ad. 


so that the test is indeed a level a test. This proves part (1). Now we turn to part (2). 
We need to show that R(X,,..., X;,) isa (1—a@)-confidence region. Let us calculate 


Po[R(Xi,..-,X,) > 6] = Pol8(X1,--., Xai 9) = 0] 
1—Po[5(X1,...,Xn3 9) = 1] 


=l-a 


where the last line follows from our assumption that 6 has probability of type I error 
a, for all simple nulls. This proves (2) and completes the proof. Oo 
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Remark 5.11 When we follow the process described in part (2) of Theorem 5.10 
to get a region R from a test function 6, we speak of inverting a test. 


Remark 5.12 Notice that in part (2) we say that R(X),...,X,) is a region 
and not an interval. The reason is that, depending on the exact form of 6 and the 
model f(x; 6), the set R(X),..., X,) may be a union of intervals, or perhaps even 
a more complicated set. For some forms of 6 and some models f(x; @), the region 
R(X,,..., Xn) is indeed an interval. It is not hard to check that likelihood ratio 
tests and Wald tests for one-parameter exponential families do indeed yield a region 
R(X,,..., Xn) that is an interval. 


Example 5.13 (Mean of a Gaussian) 


Compare the form of the test in Example (4.22, p. 116) with the form of the two-sided confidence 
interval in Exercise (59, p. 135) and conclude that the test and the interval are dual to each other. 


Example 5.14 (Wald Tests and Intervals) 


Compare the form of the approximate Wald test in the example following Theorem (4.26, p. 122) 
with the form of the two-sided approximate Wald interval in Exercise (63, p. 140). 


Example 5.15 (Likelihood Ratio Tests and Intervals) 


Compare the form of the approximate likelihood ratio test in the example following Theorem (4.23, 
p. 120) with the form of the two-sided approximate likelihood ratio interval discussed after the 
proof of Proposition (5.9, p. 140). 


Note that in Theorem 5.10, we only considered two-sided intervals and tests. 
What about unilateral intervals and tests? For unilateral results, one direction is 
very easy: if (—oo, U] is a right-sided (1 — @)-confidence interval for 0, then 6 = 
1{U < 6} is a level a test for {Hp : 6 => Oo} vs {HM : 8 < Oo} (and symmetrically 
for right-sided intervals).° So it’s still easy to get unilateral hypothesis tests from 
unilateral confidence intervals. The opposite direction is more complicated. Getting 
a unilateral interval from a unilateral test depends on the form of the test function 
and on the form of the model under consideration.’ Below we give a case where it’s 
possible. 


>The proof of this is analogous to the first part of Theorem 5.10. 


“The problem is that, as we saw in Theorem 5.10, we have no guarantee in general that the region 
we get from inverting a test will be an interval, much less so a “one-sided” interval, unless there 
are further conditions. 
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Proposition 5.16 (One-Sided Intervals from Unilateral Tests) Let X,,..., Xn 
be an tid sample from a one-parameter exponential family with density (or 
frequency) 


f(x; 8) = exp{n(O)T (x) —d(6) + S(x)}, = x EX, 0E OCR. 
such that n(-) is strictly increasing and continuously differentiable, and © is 
open. Assume that t = )~/_, T(X;) is a continuous random variable, with 


distribution function Pg[t < t] = G(t; 8). 
I. Let 6(X1,..., Xn; 90) be the UMP test of 


pe 
H,:90> 6% 


at level a, as defined in Theorem (4.16, p. 109). Then, the region 
R(X%,..., Xn) = {8 €O: 6(M%,..., Xy3 0) = OF 


is a (1 — @) left-sided interval of the form [L(X,,..., Xn), +00). 
2. Let 5(Xq,..., Xn; 00) be the UMP test of 


Hy): 90 > % 
H,:0 <&% 


at level a, as defined in Theorem (4.16, p. 109). Then, the region 
R(XG,..., Xn) = (0 € ©: 6(G,..., X53) = 0} 


is a (1 — @) right-sided interval of the form (—o0, U(X, ..., Xn)]. 


Proof We will prove only part (1), as (2) will then follow by symmetric arguments. 
The form of the test function 6(X|,..., X,; 0) is given by Theorem (4.16, p. 109) 
to be 


6(X), re) Xns O)=1{t(X, sey Xn) > di—a(9)} = 1{c(X, re) X) = qi—a(6)} 
where g\—a (6) is the (1 — w)-quantile of G(t; 9). It follows that 


R(X1,...,Xn) = {98 €@:1(X1,....Xn) <Gi-a(9)} 
= {0 €O:G(t(X,..., Xn); 9) < G(gia(¥); 9)} 
= {v € 0: G(t(X%),..., Xn); 0) < 1-—a} 
= {8 €0O:1-—G(t(X%,..., X,); 0) > a} 
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where the second equality follows since G(t; 1) is non-decreasing in ¢ for all ?. 
If we can show that G(t; 7) is continuous with respect to 3, then the region R 
will necessarily be a union of intervals. If f we can also show that 1 — G(t; 0) = 
Pp[t(X%1,..., Xn) > t] is increasing in ? for all f, then it will be clear that R will in 
fact be a single contiguous interval of the form [L, +-oo), for some random variable 
L. But under our conditions, Py[t(X4,..., Xn) > t] has indeed been proven to be 
differentiable and increasing in v in the first part of the proof of Theorem (4.16, 
p. 109).° To complete the proof, we need to show that the confidence level of 
R(X,...,Xn) = [L(X%,..., Xn), +00) is indeed 1 — a. This follows easily by 
observing that for any 3 € O: 


Po[L(X1,.--,Xn) SO] = Po[R(X1,..., Xn) 3 8] = Po[5(X1,..., ns 9) = OI 
= P»[t(X1, Hae Xn) = Gi-a(t)] 
= G(qi-a(%); #) 


=l-a. 


oO 


In non-technical terms, the theorem says that under some conditions, inverting 
a one-sided test in an exponential family will give a one-sided confidence interval. 
The details of exactly how this interval is constructed are not the most essential part 
here. The important thing is that we have found that the optimal one-sided tests can 
be used to yield confidence intervals. Since the tests are optimal, should the intervals 
not be optimal too? But what do we mean by an optimal confidence interval? We 
will consider these questions in the next paragraph. 


5.4 Optimality in Interval Estimation 


When discussing hypothesis tests, we saw that there are cases (depending on the 
hypothesis pair structure) where there was an optimal test function that one should 
use. It is therefore natural to wonder whether there are also cases in interval 
estimation, where there is an optimal confidence interval that one should use. How 
should one define optimal, though? It seems that any definition of optimality should 
satisfy the following two criteria: 
1. Intuitively, optimal confidence intervals should be as “short” as possible on 
average, subject to being able to respect their confidence level: the shorter the 
interval, the more precise our localisation of our parameter. 


>Recall that in that theorem we proved that the derivative of the mapping ® t> 
E9[5(X1,..., X,)| = Py[t = c] exists and is positive for all % and all c. 
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2. Mathematically, we have seen that there exists a natural duality between con- 
fidence intervals and hypothesis tests. Therefore, any notion of optimality for 
confidence intervals should be dual to the notion of optimality for hypothesis 
tests. In other words: inverting an optimal hypothesis test should give us an 
optimal confidence interval. 

Since we have seen that in general there can be no optimal test in bilateral 
hypothesis pairs, the second criterion rules out hopes of being able to obtain optimal 
two-sided confidence intervals. What about one-sided intervals, though? It turns out 
that the following definition of optimality for one-sided intervals satisfies both of 
the stated criteria: 


Definition 5.17 (Uniformly Most Accurate One-Sided Intervals) 
Let [L(X1,..., X,), too) and [M(X1,..., X,), +00) be two left-sided (1 — a) 
confidence intervals for 0. If for all 6 € ©, 


Pe[O-—L>e] < Po[@-—M > €], Ve>O, 


then [L(X,,..., X,), +00) is said to be more accurate than [M(X),..., Xn), 
+oo) at confidence level 1 — a. If [L,-+00) is more accurate than any other 
competing (1 — @) left-sided interval, then it is called a uniformly most accurate 
(UMA) left-sided interval at confidence level (1 — @). 

Let (—oo, U(X),..., X,)] and (—oo, M(X),... , X;,)] be two right-sided (1 — a) 
confidence intervals for 0. If for all 6 € ©, 


Pp[U -@>e]<Pe[M—-O>c], Ve>0, 


then (—oo, U(X,,..., X;)] is said to be more accurate than (—oo, M(X,..., 
X,)| at confidence level 1 — a. If (—oo, U) is more accurate than any other 
competing (1 — @) right-sided interval, then it is called a uniformly most accurate 
(UMA) right-sided interval at confidence level (1 — @). 


Remark 5.18 (On Interpreting the Optimality of Intervals) Since one-sided 
intervals have infinite length, we cannot really make sense of what it means to have 
a “shortest” interval. Therefore, we define a one-sided interval to be most accurate 
if the bound it provides is less likely to be at a distance larger than e > 0 from the 
true parameter than any other competing interval, whatever the true parameter may 
be, and whatever € > 0 may be. Loosely speaking, the average tightness of a most 
accurate interval’s bound is higher than the average tightness of any other interval’s 
bound. Figure 5.1 provides a visual illustration of the concept. 


Our definition can be seen to satisfy the requirement of intuitively being equiva- 
lent to “shortness” of the confidence intervals. The next proposition establishes that 
it also respects the duality with hypothesis tests (at least within the context of one- 
parameter exponential families) in the sense that the inversion of the uniformly most 
powerful hypothesis test yields the most accurate confidence interval. 
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Fig. 5.1 Illustration of the definition of a most accurate left-sided interval. The idea is that, given 
€ > 0, the optimal interval’s lower bound L(X1,..., X7) is less likely to fall in the shaded region 
than the lower bound of any other J/eft-sided interval (in both cases subject to the constraint of 
having confidence level | — a) 


Proposition 5.19 (UMP Tests = UMA Intervals in Exponential Families) 
Let X,,..., X; be an iid sample from a one-parameter exponential family with 
density (or frequency) 


F(x; 0) = exp{n(0)T (x) — d(0) + S(x)}, xEX,P0EOCR. 


such that n(-) is strictly increasing and continuously differentiable, and © is 
open. Assume that t = ~;_, T(X;) is a continuous random variable, with 


distribution function Pg[t < t]| = G(t; 9). 
Given any > € @, let 6(Xq,..., Xn; 9) be the UMP test of 


Ho: 0 < % 
H,:0> 4 


at level a. Then, the region 
R(X%,...,Xn) = {8 €O:6(%,..., Xn; 3) = 0} 


is a uniformly most accurate (1 — a) left-sided confidence interval at confidence 
level l—a. 


Remark 5.20 Of course, the symmetric version of this theorem holds true for 
right-sided intervals. 


Proof From Proposition 5.16 (p. 144) we know that R(X1,..., Xn) is a confidence 
interval of the form [L(X1,..., Xn), +00), for some statistic L, whose confidence 
level is 1 —a@. So R(X,,..., Xn) is indeed a left-sided (1 — a) confidence interval. 
Therefore, it suffices to show that [L, +-0o) is uniformly most accurate. To this aim, 
let [M(X,,..., X,), +00) be any other | — @ left-sided confidence interval. Define 
W(X1,...,Xn3 8) = 1{M(X),...,X,) > 6} to be its dual test, which will have 
level aw (to see this, follow the same steps as just above, replacing L by M). Givena 
6, € © and ane > 0, define 0) = 0; — € (so that 0; > 0). Since 6(X1,..., Xn3 %) 
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is UMP, we have: 


Po, [6(X%1,...,Xni 9) = 1] = Po, [W(%1,..., Xn3 Oo) = 1) 

= > Po,[O < L(X,...,Xn)] = Po, [00 < M(X,..., Xn)] 

=> Po [L(%,..., Xn) < %] < Po,[M(X1,..., Xn) < 

==> Po,[ => L(X%,...,Xn)] < Po, [00 => M(X%,..., Xn)] 

=> Po [A—e€ => L(X,..., Xn)] < Po [Ai —€ => M(MK,..., Xn)] 
=> P»[6,-L > «] < Po,[0,—M = €]. 


Since 0; € © and «€ were arbitrary, we have established that [L, +00) is more 
accurate than [M, +00). oO 
Exercise 65 Let X),..., X) - N(, 07), where o? is known. Find the expression 
for the UMA left-sided interval for jz at confidence level 1 — a. 


Exercise 66 Let Xj,...,Xy ee Bern(p). Using the sufficient statistic 
T(X1,...,Xn) for p, find the UMA left-sided interval for p at confidence level 
1 —a, by inverting the test 
Ho: p<po vs HM: p> po. 

The endpoints of this interval are not as explicit as in the previous exercise. 
Unfortunately, one of the conditions of Proposition 5.19 is not satisfied (which 
one?). Thus, for most value of p, the coverage probability will only approximately 
be 1 —a. 

Exercise 67 Show that the uniformly most accurate interval in Proposition 5.19 


coincides with the interval constructed using the pivot Y, = F,,(t), as in 
Exercise 64, p. 141. 


5.5  OnInterpreting Confidence Intervals 


It is very important to take care when interpreting the meaning of a confidence 
interval. Notice that 


inf Pol L(X1,..-.Xn) <0< U(X1,....Xn)| >l-a 
cO 
is an equivalent statement to 


inf Po{0 E Eeceeyoy U(X1,....X0) |] ee 
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Fig. 5.2  Visualising the notion of a confidence interval at confidence level 1 — a. The vertical 
line represents the location of the fixed parameter value on the real axis. The parallel black lines 
represent realisations of the random interval [L, U] for r = 24 different random samples from 
F(x; 6). We can see that most of them cover @ but some of them fail to do so. By the law of large 
numbers, we expect that as the number of replications r — ov, the proportion of intervals not 
covering 0 will gradually converge to a number smaller than a 


Though mathematically these statements are equivalent, the second way of writing 
the statement may lead to a misinterpretation of what an confidence interval means. 

Specifically, it is the interval [L, U] that is random and not the parameter 6. So 
saying that “the probability that the parameter fall inside the interval is at least 1—a”’ 
is wrong: the parameter is not going or falling anywhere, it is fixed. It is the interval 
that may change for different samples X1,.., X,, and may or may not cover the 
parameter (see Fig. 5.2). Therefore, one should say “the probability that the interval 
cover the parameter 0 is at least (1 — a)”. 

A different way of clarifying this is by noticing that: 


Pol L(X1,..., Xn) SO < U(X... Xn) ]=Po[ EL(X1, Xn) S OWA (UK, Xn) = 63] 


where the right-hand side emphasises that the probability statement applies to the 
random confidence limits L and U, rather than to the deterministic parameter 0. To 
make sure that we avoid confusion, it is better to write P» {[L , U] 5 6} instead of 
Po {6 €[L, U}}. 

Concluding this section, we give two exercises which show how the notion of 
confidence intervals (and dual tests) admits a much more general interpretation 
when considering vectors of several parameters. 
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Exercise 68 Let X,,...,X, be random vectors in R*, defined as X; = 
(Xi1,Xj2)", where Xi1,.... Xm N(u1,02), Xia... Xn2 '% N(Uu, 02), and 
the {X;,}/_, are independent of the {X;2}?_,. Assuming that o is known, we wish 
to construct a confidence region for the parameter vector ft = (/1;, [l2)', that is, a 
random subset C(X,,..., X,,) of R? satisfying 


Py [uw €C(X1,...,X,)] = 1-a, Vm eR? 


for a certain given confidence level 1 — a, a € (0, 1). 
1. Consider confidence regions for 2 = (11, 42)" of the form: 


Z o z o 
Ci(X1,...,X =| ei 4 =a = Sr 
(X4 )= jE =a Fe sms itz ar 


= oO = os 
X2— Z3-a'/2—= < X2 + Za —|. 
2 1-a’ /2 a M2 2 1—a’ /2 Ja 
Find the value of a’ for which C,(X1,..., Xn») is a confidence region with 


confidence level 1 — a. 
2. Consider now confidence regions for ye of a different form, namely: 


Co(X1..6 Xn) = [ow ER? (hi — 11)? + (ha — p2)?) = Qh. 


Find the value of Q for which C)(X1,..., X,) is a confidence region for jw with 
confidence level 1 — a. 

3. Let X} = —0.7, X. = 0.6,n = 9, 0? = 1. Draw the regions C, and C) at 
confidence level 95% on the plane R?. Find the ratio of the areas of the two 
regions. Which is preferable? 


Exercise 69 In the same notation and under the same assumptions as in the 
previous exercise, construct two test functions to test Ho : w = Ovs Hi: wp ~ Oat 
level a € (0, 1), by inverting the previous regions. 
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A.1___ Probability Factsheet 


This section provides a snapshot of some of the main probabilistic concepts and 
properties that are made use of in the text. For a more detailed coverage, the reader 
is referred to Knight [14], Chaps. 1-3. 


Events 


A random experiment is a process whose outcome is uncertain. The possible 
outcomes, and combinations thereof, are described in the language of set theory. 
In principle, any statement that makes reference to the outcome of a random 
experiment should be expressible via this language. In detail: 


A possible outcome @ of a random experiment is called an elementary event. 
The set of all possible outcomes, say Q is assumed non-empty, Q 4 9. 

An event is a subset F C Q of Q. An event F “is realised” (or “occurs’’) 
whenever the outcome of the experiment is an element of F. 

The union of two events F) and F2, written Fi U F> occurs if and only if either 
of F, or Fy occurs. Equivalently, m € F; U F if and only ifw € F, orw € Fh, 


FPF, U Fy = {(@ €Q: 0 € F, ora € F} 
The intersection of two events F; and F, written F, M F, occurs if and only if 
both F, and F, occur. Equivalently, @ € F, M F, if and only ifm € F; and 
o€ Fy, 


FLO Fy = {ow € Q: 0 € F, anda € Fy} 


Unions and intersections of several events, fF; U...U F, and Fi 0... F, are 
defined iteratively from the definition for unions and intersections of pairs. 
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The complement of an event F’, denoted F‘, contains all the elements of Q that 
are not contained in F, 


F°={weEQ:o € Fh. 


Two events F; and F> are called disjoint if they contain no common elements, 
thatis F) N Fy = @. 
A partition {F,}n>1 of Q is a collection of events such that F; N F; = @ for all 
if J, and Un>1 Fh =Q. 
The difference of two events F; and F> is defined as F,\ Fx = Fi F5. It contains 
all the elements of F; that are not contained in F. Notice that the difference is 
not symmetric: F; \ Fy 4 Fy \ Fy. 
It can be checked that the following properties hold true 

G@) (F,UR)UR, = FU) UR) = FLU MUP 

Gi) (FiN F))NF3 = FN) Fs) = FNL FP; 
(iii) FUN (2 U F3) = (FL 9 Fp) U (FFs) 
(iv) FpU (2 3) = (FU) (FU Fs) 

(v) (FLU Fo) = FE Fy and (F, 9 Fo)° = Ff U FY 


Probability Axioms 


A probability measure P is a real function defined over the events of 2, assigning a 
probability to any event. This can be interpreted as a measure of how certain we are 
that the event will occur. It is postulated to satisfy the following properties: 


1. 
an 
3. 


P(F) = 0, for all events F. 

P(Q) = 1. 

If {F,}n>1 are disjoint events, and F = U, >, F, is an event given by their union, 
then 


P(F) =) (P(r). 


n>1 


The following properties are immediate consequences of the probability 


axioms: 


P(F*) = 1—P(F). 
P(F, 9 Fy) < min{P(F)), P(F2)}. 
P(F, U Fy) = P(F)) + PU) — PUA Fy). 


Continuity from below: let {F;,}n>1 be nested events, such that F; C Fj+1 for 
nOO 


all 7, and let F be an event given by F = U,>)F,. Then P(F,) —> PCF). 
Continuity from above: let {F,},>1 be nested events, such that F; > Fj+, for 


all 7, and let F be an event given by F = N,>1 Fy. Then P(F;,) = P(F). 
If Q = {@,...,@K}, K < oo, isa finite set, then for any event F C Q, we have 


PUF) = ¥jyjer (oj): 
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Conditional Probability and Independence 


Suppose we do not know the precise outcome w € & that has occurred, but we are 

told that m € Fy for some event F). If we are asked to now calculate the probability 

that w € F, also, for some other event F), then we need to calculate the conditional 

probability of F; given F>. 

¢ For any pair of events F,, F) such that P(F2) > 0, we define the conditional 
probability of F, given F, to be 


P(F, 9 Fy) 


P(F,| Fo) = PU) 


¢ Let G be an event and {F,,},>1 be a partition of Q such that P(F,,) > 0 for all n. 
We then have: 
— Law of total probability: 


P(G) = ) | P(G|F,)P(Fn) 


n=1 


— Bayes’ theorem: 


aay PUFANG) — _ P(G|F,)P(F)) 
P(F;|G) = P(G) ~ yy P(G|F,)P(F,) 


¢ The events {G,,},>1 are called independent if and only if for any finite sub- 
collection {G;,,..., Gi, }, K < oo, we have: 


P(G;, N++- Gig) = P(G;,) x P(G;,) x ... X P(Gig) 


Random Variables and Distribution Functions 


Random variables are, simply stated, numerical summaries of the outcome of a 
random experiment. Since the result is random, such numerical summaries are 
random, too. They allow us to not worry too much about the precise structure of the 
outcome w € Q, but concentrate on a numerical summary instead. If that numerical 
summary is all we really care about, we can concentrate on the range of a random 
variable X , rather than consider Q2 itself. 

e A random variable is a real function X :Q —> R. 

¢ We write {a < X < b} to denote the event 


{aE Q:a< X(w) <b}. 
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More generally, if A C R is a more general subset, we write {X € A} to denote 
the event 


{wa €Q: X(w) € A}. 


¢ If we have a probability measure defined on the events of (2, then X induces 
a new probability measure on subsets of the real line. This is described by the 
distribution function (or cumulative distribution function) Fy : R — [0,1] of a 
random variable X (or the law of X ). This is defined as 


Fy(x) = P(X < x). 


e By its definition, a distribution function satisfies the following properties: 
(i) x<y => Fy(x) < Fr(y) 
(ii) lim, Fy (x) =1, limy—+—oo Fy (x) =0 
(iii) limy), Fy(y) = Fy (x), that is, Fy is right-continuous. 
(iv) lim,4, Fy (y) exists, that is, Fy is left-limited. 
(v) Pa < X <b) = Fy(b)— Fy(a). 
(vi) P(X > a) =1- F(a). 
(vii) Let Dy := {x € R: Fy(x) —lim,y, Fy(y) > 0} be the set of points 
where Fy is not continuous. 
— Dy is acountable set (Lemma A.11, p. 169). 
— If PQX e€ Dr}) = 1, then X is called a discrete random variable 
(equivalently, X has a finite or countable range, with probability 1). 
— If Dy = @, then X is called a continuous random variable (the 
distribution function F’y is continuous). 
— It may very well happen that a random variable may be neither discrete 
nor continuous. 


Probability Density and Probability Mass Functions 


¢ The probability mass function (or frequency function) fy : R — [0,1] of a 
discrete random variable X is defined as 


Fx (x) = P(X = x). 


By its definition, a probability mass function satisfies 
(i) P(X € A) = Vyeany Sx (t), for AC Rand ¥ = {x ER: fy(x) > O}. 
(ii) Fy (x) = Vre—co.xjnx Sx (0), for all x ¢ Rand Y = {x ER: fy (x) > 0}. 
(iii) An immediate corollary is that Fy (x) is piecewise constant with jumps at the 
points in ¥ = {x eR: fy(x) > O}. 
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¢ A continuous random variable X has probability density function fy : R > 
[0, +00) if 


b 
Fy(b) = Fx@) = i fa (tdt. 


for all real numbers a < b. By its definition, a probability density satisfies 


(i) Fx(x) = fo, fx(dx 
(ii) fy(x) = Fy (x), whenever fy is continuous at x. 
(iii) Note that fy(x) # P(X = x) = 0. In fact, it can be f(x) > 1 for some x. It 
can even happen that f is unbounded. 


Random Vectors and Joint Distributions 


A random vector X = (Xj,..., Xa)" is a finite collection of random variables, 

arranged as the coordinates of a vector. The point is that we may want to make 

probabilistic statements on the joint behaviour of all these random variables. In this 

case, we need to define their joint distribution, and respective joint density (or joint 

frequency). 

¢ The joint distribution function of a random vector X = (X),..., X, ay is defined 
as: 


Fx(%1,...,Xa) =P(X, <XxX1,...,Xqg < xq). 


¢ Correspondingly, one defines the 
— joint frequency function, if the {X, Avan are all discrete, 


Fx(X1,.--5%a) = P(X, = 15655 5X4 = Xq). 


— the joint density function, if there exists fx : R“ — [0, +00) such that: 


X1 Xq 
Reid =] a Fx(u,...,Ua)duy,...dug 


In this case, when fx is continuous at the point x, 


d 
Fx(%1,---5X%a) = —— >—— Fx (m1, ..., Xa) 
OX, ...0Xq 


Marginal Distributions 


Given the joint distribution of the random vector X = (Xj,..., Xa) we can 
always isolate the distribution of a single coordinate, say X;. 
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¢ Inthe discrete case, the marginal frequency function of X; is given by fy, : R > 
[0, +00): 


A SS yo e ae steer eo Xd) 
Xq 


Xj—1 X7+1 


¢ In the continuous case, the marginal density function of X; is given by fy, : R > 
[0, +00): 


[oe lo. e) 
Fx; (Xi) = oh Sx 1. - ++. Vi-15 Xi, Vit1,---. Ya)dy ...dyj—-1dyi4idya. 
= —CoO 


¢ More generally, we can define the joint frequency/density of a random vector 
formed by a subset of the coordinates of X = (X),..., XA). say the first k 
eee <d),(Xj,... i", via 


Discrete case: fy,....x,(%1,...,X%&) = iia Daag dR age 
Xk+1,.++,Xd). 
— Continuous case 
+ + 
Pigite Miosaastey = Joy ciel, Ji ieee 
..dxq. 


¢ In other words, in order to find a marginal density/frequency of a subset of 
random variables, we need to integrate/sum out the remaining variables from 
the overall joint density/frequency. 

¢ It is important to note that the marginals do not uniquely determine the joint 
distribution. 


Conditional Distributions 


Similarly to the notion of conditional probability, we may wish to make probabilistic 
statements about the potential outcomes of one random variable, if we already 
know the outcome of another. For this we need the notion of conditional density 
and conditional frequency functions. If (Xj,...,Xq) is a continuous/discrete 
random vector, we define the conditional probability density/frequency function of 
(X,,..., Xx) given {Xx41 = Xe41,...,Xq = Xa} as 


FP iiscastg Bita ns Bes Ms +05 Xd) 
tx, ses Xk [Xk ise Xy 1. XkXe+1 «es Xa) 

Pps inc tis 5%) 
provided that fy, Ah X,(Xk4+1,---,Xa) > 0. The corresponding distribution 
functions are: 

e In the discrete case: 
Fy Xe lXeptseXa Klas Xe |Xkb1 o  Xa) 


=x > fx, a Xp lXpprenXy Uy +++ Ue |Xk+1,-++, Xa): 


uy SX Uk SXk 
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e In the continuous case: 


Independent Random Variables 


The random variables X,,...,Xq are called independent if and only if, for all 
X1,.6-,X_G E R 

Fy,,....X¥4(41,-.+,%a) = Fy, (41) x... Fx, a). 
Equivalently, X;,..., Xq are independent if and only if, for all x,,...,xg € R 


Ph incqltp Misvs 0a Xa) = Jy C1) % s+. % Fey a): 


Note that when random variables are independent, conditional distributions reduce 
to the corresponding marginal distributions. Intuitively, knowing the value of one of 
the random variables gives us no information about the distribution of the rest. 


Expectation, Variance, Covariance 


The expectation (or expected value) of a random variable X formalises the notion 
of the “average” value taken by that random variable (in a sense, the typical value, 
what we expect). It is defined as follows. 

— For continuous variables: 


+00 
aX] -/ x fy (x)dx. 


co 


— For discrete variables: 


[X]= \ox f(x), X= {xr eER: fy(x) > 0}. 


xEX 


The expectation satisfies the following properties: 

¢ Linearity: E[X, + aX] = E[X,] + aE[X]. 

* Efh(x)] = do yey A(X) fx (x) (discrete case) 

or 

A [A(x)] = rie h(x) f (x)dx (continuous case). 
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The variance of a random variable X expresses how disperse the realisations of X 
are around its expectation. 


Var(X) = E[(X — E(X))’] (if E[X7] < 00). 


Furthermore, the covariance of a random variable X, with another random variable 
X> expresses the degree of linear dependency between the two. 


Cov(X1, X2) = E[(X1 —E(X1))(X2 —E(X2))]_ Gf E[X7] < 00). 


The correlation between X, and X> is defined as 


Cov(X, F X2) 


/Var(X1) Var(X>) 


It also expresses the degree of linear dependency. Its advantage is that it is invariant 
to changes of units of measurement, and moreover can be understood in absolute 
terms (it ranges in [—1,1]), as a result of the correlation inequality (itself a 
consequence of the Cauchy—Schwarz inequality): 


|Corr(X;, X2)| < vy Var(X1) Var(X2). 


Some useful formulae relating expectations, variance, and covariances are: 
¢ Var(X) = E[X?] — (E[X])? = Cov(X, X) 
* Var(aX + b) = a? Var(X) 
bd Var(>-; Xj) = Day Var(X;) + Vis; Cov(X;, X;) 
* Cov(X), X2) = E[X, X2] — E[XiJE[X9] 
* Cov(aX, + bX2, Y) = aCov(X), Y) + bCov(X2, Y) 
* if E[X?] + E[X3] < oo, then the following are equivalent: 
(i) E[X) X2] = E[XiJE[Xo] 
(li) Cov(X, X2) = 0 
(iii) Var(X) + X2) = Var(X1) + Var(X2) 
Independence will imply these three last properties, but none of these properties 
imply independence. 


Corr(X,, X>) = 


A.2 _ Taylor’s Formula and the Inverse Function Theorem 


The following two classic analysis results will often be used. See Rudin [21] 
(Chaps. 5 and 9) for their proofs.! 


An elementary proof of the one-dimensional form of the inverse function theorem (which will be 
all that will be needed for this text as stated below) can also be found in Corwin and Szczarba [5], 
Chap. 9. 
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Theorem A.1 (Taylor’s Formula with Lagrange Remainder) Let h(x) :R— 
R be k-times continuously differentiable on the closed interval I with endpoints 
x and y, for some k > 0. If f*+" exists on the interior of I, then there exists 
t € (0, 1) such that 


jh! h®) 
A(z) = AQ) + #O)—y») + Pay? +... + SMe — yt 
LeNG) aa 
+p —» 


for—é =tx+(1—-t)y. 


Theorem A.2 (Inverse Function Theorem) Let h(x) : R — R be continuously 
differentiable, with a non-zero derivative at a point X, € R. Then, there exists an 
€ > 0 such h— continuously differentiable on (h(xo) — €, h(xo) + €), and in fact 
(h')'(y) = [A (AT '(y))I for Ly — A(xo)| < €. 


A.3. Two Concentration Inequalities 


Lemma A.3 (Markov’s Inequality) Let X be a non-negative random variable. 
Then, given any € > 0, 


BLx] 
as) 


P[X >e]< 


Proof Notice that 0 < €1{X > ¢«} < X. Therefore, E[e1{X > €}] < E[X]. But 


tlel{X > e}] = cE[L{X > e}] =e (1 -PLX > e] +0-P[X <e]) = €P[X >]. 


Combining our findings yields the result. 


Lemma A.4 (Chebyshev’s Inequality) Let X be a random variable with finite 
mean E[X] < co. Then, given any € > 0, 


P||X —E[X]| =e] < 


Proof Define Y = (X — E[X])* and apply Markov’s inequality to Y. 
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A.4 Monotonicity and Covariance 


Lemma A.5 (Covariance of X and g(X)) Let X be a real random variable 
with E[X?] < oo. Let g : R > R be a non-decreasing function such that 
E[g?(X)] < oo. Then, 


Cov[X, g(X)] = 0. 


Proof By definition of covariance: 


Cov[X, g(X)] = r(x — 2) (g(x) - alg(x)])} 


= {(X - 2) ((X) — 8) + sw - ‘[g(X)1)} 
= E{(x - 1) (g(x) - gw) +B{(X - 2) (9H) -Elecxy))} 


=0 


Now g is non-decreasing so if ¥ > pu, then g(X) > g(u). If X < p, on the other 
hand, then g(X) < g(2) also. Therefore 


(X — w)(g(X) — g(u)) = 0 


and the result follows. oO 


A.5 Quantiles 


Recall that, for a random variable X taking values in 1, we define its distribution 
function to be: 


Fy :R-> [0, 1], 
Fy(x) = P[X < x], xeER. 
Simply put, the distribution function is the answer to the following question: given 


a real number x € R, what is the probability PLY < x] that X fall at or below x? 
We could also ask the opposite question: 


Given a probability a € (0,1), is there a real number x such that PLY < x] = a? 
(A.1) 


The motivates the definition of the so-called quantile function. 
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Definition A.6 (Quantile Function and Quantiles) 
Let X be a random variable and Fy be its distribution function. We define the 
quantile function of X to be the function 


Fy :(0,1)-R 
Fy (a) = inf{t e R: Fy(t) = a}. 


Given an a € (0, 1), we call the real number 
qa = Fy (a) 


the a-quantile of X (or, equivalently, of Fy). 


Recall that Fy is always non-decreasing, by its definition. Hence, there are two 
possibilities: 
(A) Fy is in fact strictly increasing.” Then Fy is also invertible, and we have 


Fy (a) = Fy'(@), Va € (0, 1). 


In this case, our question (A.1) has a unique answer, and the interpretation is 

very simple. 

(B) Fy is non-decreasing, but not strictly increasing.* Then there are two things 
that may happen: 

(Bl) There may be multiple real numbers x that satisfy Fy(x) = a (for 
example, take a = | — p and take X to be a Bern(p) random variable; 
then any x € (0,1) satisfies that Fy(x) = 1— p = a). In this case, 
Fy'(q) is a set, not a single real number, 


Fy'(a) = {x ER: Fy(x) =a}. 


So, which of these numbers should we pick as the answer to our question 
(A.1)? The most mathematically appropriate choice turns out to be the 
infimum of this set.* Since Fy is right-continuous (being a probability 
distribution function) the infimum of this set equals Fy (a). 


>This is the case if X is continuous with a density that satisfies fy(x) > 0 Vx ER. 

For regular models, this happens if X is discrete (so Fy is a step-function) or when X is 
continuous but there exists at least one open interval J such that fy(x) = 0, Vx € J. 

“This is due to the fact that, with this definition, we have F(X) > a <> X > F—!(q), which 


is very useful when generating random variables with a prescribed distribution, see Exercise |1 
(p. 22). 
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® Fy (a) a 


(a) Quantile in Scenario (A). 


0 1 1 1 
0 = Fy(1—-p) 1 z 


(b) Quantile in Scenario (B1). (c) Quantile in Scenario (B2). 


Fig. A.1 Evaluation of the quantile function for scenario (A), (B1) and (B2) above. Intuitively, in 
order to find gy, we follow the red arrows. (a) Quantile in Scenario (A). (b) Quantile in Scenario 
(B1). (ce) Quantile in Scenario (B2) 


(B2) There may be no real number x such that Fy (x) = a@ (for example, take 
some a € (1 — p,1) and take X to be a Bern(p) random variable). In 
this case, our question (A.1) has no answer. So, instead we have to settle 
for the first time that Fy (x) “jumps” above a, which is again given by 
Fy (x). 

If all of this sounds complicated, Fig. A.1 gives an intuitive illustration that should 
clarify things. 


Exercise 70 Let X ~ Exp(A) where A > 0. Show that the a—quantile of X is 
given by 


qa = Fy (a) = —log(1 —@)/A, 


for0O <a <1. 
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Exercise 71 (Quantiles Determine Distributions) Let X and Y be random vari- 
ables with respective distribution functions Fy and Fy. Suppose that Fy (a) = 
Fy (a) for alla € (0, 1). Prove that Fy = Fy. 


A.6 Moment Generating Functions 
The moment generating function (MGF) is a useful tool in probability theory that 
can often help us to prove independence of random variables or to determine their 


moments (hence the word moment generating). 


Definition A.7 (Moment Generating Function) 
Let X be a random variable taking values in R. The MGF of X is defined as 


My(t): R> RU {oo} 


My(t) = le], teR. 


Notice that My(0) = 1 always, so there exists at least one t¢ € R for which 
Mx(t) < co. When My(f) is finite on an open neighbourhood of zero, then all the 
moments of X are defined, and can be determined by evaluating derivatives of My 
at zero. 


Proposition A.8 (Moments via the MGF) Let X be a random variable taking 

values in R, and let I be an open interval such that My (t) < co forallt € I. It 

holds that 

1. E[|X|ke'*] < 00 forall k € Nandallt € I. 

2. For allt € I, the function My is k times differentiable, for all k € N (hence 
infinitely differentiable on I). 

3, For allk € Nand allt € 1, E[X*e'*] = {x (7), 


4. If {0} C I, then E{|X|*] < 00 and E[X*] = ““X (0), for all k € N. 


Proof We start with part 1. Fix fo € J andk € N. Since / is open, there exists a 
6 > Osuch that [fo — 5, fo + 6] C I. Since the exponential function is increasing, we 
have 
|X |keo*X — xk eX sy > 0} + (—X)te*1{X < 0} 
= et O)X a, (XIX > 0} +e O* uy 5(—X) UX < 0}, 


where uz,5 : [0, 00) — [0, 00) is given by 


un.g(X) = x* exp(—6x), k>0, 6>0, x>0. 
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It’s not hard to see that Cy,5 = sup,so Ux,3(x) < 00, since the exponential will 
decay faster than any polynomial. Specifically, 


>0 x< 


ui, s(x) =a"? = bx) 
} < 0 x > 3 


so that uz,s attains its maximum at x = k/65. We conclude that 


|X |Ke* < Cy sEe@t9* 1X > 0} + Cy gE * 1{X < 0} 
< CysMyx (to + 6) + Cys My (to + 5) < oc. 


Since the choice of fo was arbitrary, we have proven part 1. 

In order to prove parts 2 and 3, we proceed recursively. Both parts are trivially 
valid when k = 0. We will now show that if 2 and 3 are valid for k — 1 (for all 
t € /), then they must be valid for k, whenever k > 1. 

Fix fp € J. We need to show that 


pyk—1 jtX pyk—1 jtoX (k-1) (k-1) 
lim LX" 'e!* — EX"! e% = My “(t)-—My (to) = 
t>t0 t —to t>10 t —fo 


TY tox 


(A.2) 


Note that all the expectations in this equation are well defined (finite) as a result of 
part 1. Applying Taylor’s formula (Theorem A.1, p. 159) to the function /,(t) = 
ge (where x is seen as a fixed constant), we obtain 


Xe _ yk leX y(t) —hy(t 
= x(t) xo) = Wt @ = xe, |& —to| < |t —ap]. 
t—to t—[ 
Note that since € depends on both ¢ and X, it’s in fact a random variable. Similarly, 


Xk-1etX = Xk etoX 


t —fo 


= XK etoX = Xk eX _ XK etox 


= XFHePX (Et), |’ — tol < | - fol. 


We must thus show that the expectation on the right-hand side tends to zero as 
t > to. Since |£ — to| < |t — tol, it suffices to bound EX**+'e§'* uniformly in t. Let 
56 > 0 be such that [ft — 26, t + 26] C J. Suppose without loss of generality that 
|t — to| < 6. It follows that fp — 6 < & < to + 6 and we can use the same approach 
as before to write: 


[X [ete = Xe ePX iy > 0} + (—X) eX UX < 0} 


IA 
=~ 


Xe OTDXI EX > OF + (—X) Te DF1LX < OF 


= et?) X ys s(X)ILX > 0} + ct  g(—X)L{X < 0}. 
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It follows that 


|X [Ft eEX < Cys Mx (to + 28) + CrtisMx (to — 28) < 00, 


since fo + 26 € J and Cy41.5 < oo. Hence 


Xk otk = Xk etoX 
t rae — X*e®*| < Cys 3[My (to + 28) 
— £0 


+My (to — 26)]|t — to| > 0, t > fo. 


Consequently, Eq. (A.2) holds true (since the term on the right-hand side of (A.2) is 
finite), which translates to 


MY?) =EX*e®* = Vine. 


The recurrence thus holds true, which establishes 2 and 3. To complete the proof, 
observe that when {0} C J, part 4 follows directly from parts 1 and 3. oO 


A further important property of the MGF is that, provided that My exists on an open 
interval containing zero, it uniquely determines the distribution of X: 


Proposition A.9 (Characterisation Property of the MGF) Let X and Y be 
two random variables taking values in R, and let Fy and Fy be their respective 
distributions. Let My, My : R — R be their MGFs. If there exists an open 
interval I containing zero, such that Mx (t) < co and My (t) < co forallt € I, 
then 


Fy = Fy =» My = My. 


We will not prove this result in its full generality, as this would require either notions 
related to Laplace transforms, or to characteristic functions (see, e.g., Billingsley 
[2], Sect. 30). We will only give a proof for the special case of non-negative random 
variables (following a saddlepoint argument of Dalang and Conus [8]). This suffices 
to cover the situations where we will use the theorem in this text. 


Proof of Proposition A.9, Assuming X,Y => 0 We first consider the case of contin- 

uous random variables, and focus on the random variable X > 0. Since X > 0, it 

follows that My (t) < oo for all tf < 0. Combining this fact with our assumption, 

means that there exists a 6 > O such that My(t) < oo for all t < 6. By 
ak 


Proposition A.8, we now know that < My exists for all k andallt < 6. Our strategy 
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will be to express Fy as a function of the derivatives of My. More specifically, 
define the function Gy (t, x) : [0,00)? > Ras 


[tx] k k 
t" d* My 
Gxnx) = ase. 
k=0 


where |z| is the largest integer less than or equal to z. We will show that for any 
given x > 0, 


jim Gyx(t,x) = Fx(x). 


Fix x > 0. Proposition A.8 shows that, for all k > 0, 


d* oo 
qe xO =E[Xx*e*] =| xe? f (x)dx, 


where the last integral is over [0, 00) by non-negativity of X¥. Thus G may be re- 
expressed as 


Lex] +00 too [ Lex] tk 
Gxt.x) = pay te fu(yddy = | Yee” | fends, 


ee 
=(x.y) 


where g(x,y) = P[W,) < tx] for Wx ~ Poisson(ty). Consequently, when 
y > x, Chebyshev’s inequality (Lemma A.4, p. 159) implies that 


0<9; (x,y) =P[W.y < tx] = PIW.y —ty <t(x-y)] 
< PIM) —ty| = ty — x)] 
Var[Wiy] sy 
~ P(y=x)P t(y =x)?’ 


Similarly, in the case y < x, we have 


0<1-9¢,(%,y) = P[W,y > tx] =P[Wy —ty >t — y)] 
P(|Wi.y —ty| >t — y)I 

Var[W,] _ Bi 
= B@—yyP ey 


IA 
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Now let « > 0. Choose h > 0 sufficiently small so that Fy (x + h) — Fy(x) < €/3 


and Fy (x) — Fy (x —h) < €/3 (sucha choice is ensured by continuity of Fy). Then 
choose t > 0 sufficiently large so that t > 6x/eh?. We have 


+oo 2 
Jt.) FO = i. ee, fray — i faly)dy 


x 


x—h 
= / (g(x, y) — D fx (y)dy + (g(x, ¥) — D fx (y)dy 


x—h 


x+th co 
4 / Ae DRO / HD BOAD 


x 


x—h 
< | los») —ALfendy + f lo(, ») — Ife dy 
0 x—h 


x+h co 
és / lore, DI FeOdy + / 1 MOLLE Ody. 


x 


Let us consider the terms on the right-hand side one at a time, and bound them 
suitably (note that if x = 0, we only need to consider the last two integrals). We 
have 


x—h 1 x—h y 
[locos =f fea 


x- 


_—h x=h h 
[ feoray s FF. 


< fete 
th? 


by our earlier calculation. Similarly, 


x 


[een feOrdy = a 


Furthermore, |g; (x, y) — 1| < 1 and |g;(x, y)| < 1 for all x, y > 0, so that 


[ley = tfeondy = [ feoyay = Fete) — Fee — i) 


and 


x+h xth 
i CEH / Pe ee ee 
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In summary, we have shown that for all t > s 


IGx(6,x) — Fel “" + fee) — Fer h)] 
I 
+ [Fr +h) = F(a) + 


II 


[Fy (x) — Fy (x —h)] + [Fx (x + A) — Fx (x)] + 


€. € 6 

F a 3 ar 37 €. 

In other words, we have shown that |Gy (t,x) — Fy(x)| < ¢€ for any « > 0 andt 
sufficiently large, which proves that lim;+ Gy(t,x) = Fy(x). The exact same 
arguments show that lim;-,o. Gy(t,x) = Fy(x), where Gy(t,x) is defined in 
analogous fashion as Gy(t,x). But Gy = Gy since My = My, which proves 
that Fy = Fy and completes the proof in the case when the random variables X, Y 
are continuous. For the discrete case, we follow the exact same argument, replacing 
integrals by sums, and proving that lim;.o Gy(t,x) = Fx(x) for all continuity 
points x of Fy (x). For discontinuity points, we then simply use the right-continuity 
of Fy. The proof is now complete. Oo 


The next lemma is useful when trying to establish the distribution of a sum or 
independent random variables. 


Lemma A.10 (Sums and MGFs) Let X and Y be two independent random 
variables taking values in R, and let Z = X + Y. If Mx(t) < ow and 
My(t) < oo for all t in an open interval I, then Mz(t) < oc forallt € I 
and 


Mz(t) = Mx(t)My(¢). 


Proof By independence, we may write 


co > Mx(t)My (t) 


II 


fe’*] ufe’™ | - ee | 


= Elexp{t(X + Y)}] = Mz(t), tel, 
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A.7__—_ Continuous Mapping and Slutsky’s Theorem 


In order to prove these two results, we will first need a couple of results regarding 
distribution functions and their convergence. 


Lemma A.11 Let F be a cumulative distribution function. Then F has at most 
countably many discontinuities. 


Proof Let Dr be the set of discontinuity points of F. Given any x € Dr, we have 
lim F(x —e) < lim F(x + €) 
€10 €{0 


since F’ is non-decreasing. It follows that there exists a rational number g(x) such 
that 


ees) < q(x) en +e), Vx € Dr. 
€}0 «10 


Furthermore, whenever x; < x2 (so that we may write x2 = x; +6, for some 6 > 0), 
the fact that F is non-decreasing implies that 


q(x1) < ee +e) < F(x, + 6/2) = F(x. — 6/2) < Hoe —€) < q(x). 


Summarising, we have constructed an injection g : Dr — Q, and thus Dr must be 
countable. oO 


Lemma A.12_ Given a sequence of random variables X, X,, X2,..., the follow- 
ing two statements are equivalent: 
d 
I. X;, > X. 
2. For all closed subsets C € R, one has 


lim sup P(X, € C) < P(X €C). 


noo 


Proof Assume first that (2) holds true, so that for C; = (—oo, a] and C2 = [a, co), 
we have 


P(X <a) =1-—P(X >a) < 1—limsupP(X, > a) = liminf P(X, < a) 
noo 


noo 


< liminf P(X, < a) < limsup P(X, <a) < P(X <a). 
noo noo 
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If a is a continuity point of the distribution function of X, it must be that P(X < 
d 
a) = P(X <a) and so P(X, <a) > P(X < a). This establishes that X,, > X. 


To prove the converse, assume initially that C = [a,b], where -—co <a <b < 
oo. There exist sequences 0 < ex \, 0,0 < dx \, O such that F(x) = P(X < x) is 
continuous at the points a — 6, and b + e, for all k (Lemma A.11). Consequently, 


lim sup P(X, € C) < limsupP(a — 6, < X, <b +e) = limsupP(X, <b + &) 
n—>oo n=oo noo 
—P(X, <a— 06) = P(X <b+e)—P(X <a —d,) = P(a-& <X <b+e). 


Letting k — oo, continuity from above of probability measures yields 
lim sup P(X, €C) < lim Pa-& <X<b+«e) 
n—>co k—> oo 


=#(P\e-a <x <6+4)] =P(X €C). 


k=1 


If a = —co or b = o&, the statement can be shown to be true by a similar argument. 
Thus (2) is true when C is an interval. 

If C = UC; is the countable union of (potentially infinitely many) closed disjoint 
intervals, the subadditivity of limit superior yields 


CO CO 
lim sup P(X, € C) = lim sup >) P(X, € Ce) < S > lim sup P(X, € Cx) 
k=1 noo 


noo noo k=1 
< > P(X € Cy) = P(X €C). 
k=1 


Suppose now that C = MC;, where each Cy is a disjoint union of countably many 
closed intervals, and C,4; © Cx for all k. Following the same course as in the first 
part of the proof, 


lim sup P(X, € C) < limsup P(X, € Cy) < P(X € Cy) ~ P(X EC), ko. 


noo noo 


To complete the proof, thus, it suffices to show that any closed set C C R can be 
written in this form. 
For every k, divide R into closed intervals of length 2~“, that is i = 2“ [j,j+ 


1]. Let Cy be the union of those intervals ic? } that have a non-empty intersection 
with C: 


C. = 'e . 


fEZ1 CAD 
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It is clear that C; is the countable union of countably many closed intervals, and 
that C, > C. If x € C, there exists an interval J such that C 1 J = @ that contains 
x. For k such that 2-* < m(J)/2 it follows that x ¢ Cx. We may thus conclude 
that C = NC. The fact that C; is closed follows by a similar reasoning, but we can 
argue differently: let x, € Cy be a sequence converging to x. There must exist an 
M such that the sequence is contained in C; N [—M, M]. This last set is closed, as 
it is the union of finitely many closed intervals. Hence x € C; M [—M, M] and so 
C; is closed. 

It remains to show that Cy4,; C Cy. Let x € Cy4 1. There exists 7 € Z such 
that x € ake © Cy4i. Or, ae Cc Tae and thus this last set has a non-empty 
intersection with C. It follows that x € J we | C Cx, and the proof is complete. O 


Proof of the Continuous Mapping Theorem (Theorem 2.25, p. 57) By Lemma A.12, 
d 
it suffices to prove that X,, > X implies lim sup P[g(X,,) < y] < P[g(X) € C] for 
noo 


all closed C C R. To this aim, let C C R be an arbitrary closed set, let 


A={xeER: g(x) €C} 


be the inverse image of C via g, and let A denote the closure of A. If D ¢ is the set 
of discontinuities of g, we may write 


A= {Anb,}u {An Deh c Du {An Ds. 


Nowifx € AN De then there exists a sequence {x;} C A such that limy—o9 x, = X 


(by definition of the closure, A). Furthermore, it holds that g(x) = limg_, 0 &(XK) € 
C, because x € Dé also. Consequently x € A, and we have proven that AM De C 
A, 

Summarising, we have 


AS AWD, (A.3) 
We now exploit this inclusion in order to write 


P[g(X,) € C] = P[X, € A] < P[Xn € Al. 
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But, 


= d 
lim sup P[X, € A] < P[X € A] [using X, > X, combined with Lemma A.12] 


noo 


<P[X €AUD,] [by (A3)] 
P[X € A] + P[X € Dg] 
—_——=—_—— 


=0 


A 


P[g(X) € C]. 


It follows that lim sup P[g(X;,) € C] < P[g(X) € C] and our proof is complete. 
noo 


oO 


Proof of Slutsky’s Theorem (Theorem 2.26, p. 57) For the first part, assume that 


d 
X, — X and Y, 4c. We may assume without loss of generality thatc = 0. 
Let x be a continuity point of Fy. We have 


P[X, + Y,, < x] — PIX n, + Yn < x,|Y,| < €] + P[X, + Y, <x,|Y,,| > €] 


< P[X, <x+e«]+PIlY,| > 4 


because {X, + Y, < x &|Y,,| < €} implies that {X, < x + ¢}. Similarly, we may 
obtain the inequality 


P[X, <x —e€] <P[X, + Y, < x] +P(l¥,| > €]. 


Rearranging and collecting terms yields: 


P[X, < x—e]— PUY, | > €] < P[X, + Yn <x] = P(X, <x+t+eé] + PllY,,| > €] 


lim P[X, <x —e«]—0< lim P[X, + Y, < x] < lim P[X, < x+¢€]+0 
noo n—>oo 


noo 


By Lemma A.11, we may find a sequence 0 < e, | 0 such that x +e; is a continuity 
point, for all k. Replacing € by ex gives 


Fy (x —ex) < lim P[X, + Yn < x] < F(x + €x). 
noo 
d 
Since x is a continuity point of Fy, letting k — oo establishes X,, + Y, > X. 
To prove the second part, let Z, = Y,; —c, so that Z, Zz 0. Thus, if we can 
d 
show X;Z, — 0, then the conclusion follows by first part of the theorem, which 


is already proven. Let e > 0 and My ¢ oo be positive sequence such that «Mx is 
a continuity point of F\y, for all k (this choice is feasible by Lemma A.11). Note 
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also that |X,,| us |X| by the continuous mapping theorem (Theorem 2.25, 57). 
Combining these ingredients yields: 


P[|XnZn| > €] S Pl|XnZn| > €.|Zn| < 1/Me] + Pl|Zn| = 1/Mi 
< Pl|X,| > eMi] + P[|Z,| = 1/Mz] 
<1-P[X,| < Mi] + PUZ,| = 1/Me] 
= im Pll Xn Zn! >e] < PIX] > eM]. 


The right-hand side can be made arbitrarily small by choosing k sufficiently large. 
Thus Z,, X,, = 0. Since X,Y, = Z,Xn + cXn, we use the first part of the theorem 
(already proven) to conclude that X,, Y;, 40, oO 


A.8  Onthe Proof of the Central Limit Theorem 


The standard proof of the central limit theorem makes use of the characteristic 
Junction, and thus involves notions from complex analysis, and more specifically the 
Lévy continuity theorem (see, e.g., Billingsley [2], Sect. 29). Since the latter result 
is beyond the scope of this text, we will provide an elementary proof here due to 
Lindeberg [17] (as presented in Dalang [7]), that is based on stronger assumptions, 
namely existence of a third absolute moment.>*® 

We first need three intermediate results. In what follows, C 7 (R) denotes the set of 
all thrice continuously differentiable bounded functions R — R, that are bounded, 
and whose first three derivatives are also bounded. 


Lemma A.13 Let Z be a continuous random variable, and {Z,}n>\ a sequence 
of random variables such that 


noo 


E[g(Z,)] — Elg(Z)] 


for all g € C}(R). Then 


Fz,(x)'—> F(x), VxeR. 


5As a matter of fact, even this weaker version of the theorem would suffice for the asymptotic 
results presented in this text: these require the sufficient statistic of an exponential family to satisfy 
the central limit theorem (as Corollary 2.24, p. 56), and the latter statistic will have finite moments 
of all orders (see Eq. (2.11), p. 51, in the proof of Proposition 2.11). 


6The same method of proof can be “upgraded” to work under only second moment assumptions, 
assuming knowledge of measure theory, in particular the monotone convergence theorem (Dalang 


[7]). 
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Proof Let x € Rand k > 1 be given. Note that we may always choose a function 
BeEC : (R) that satisfies the envelope relation 


1{z € (—00, x]} < ge(z) < 1f{z € (—o0, x + 1/k]}. (A.4) 


Then, for alln > 1, 


Fz, (x) = P[Z, < x] = E[1{z € (—oo, x]}] < Elgx(Z,,)], 


and hence by our assumption we have 


lim sup Fz, (x) < jim, ‘gk (Zn)] = Elg. (Z)] 


noo 


< E[I{z € (00, x + 1/k}] = Fz(x + 1/k). 


The same type of argument shows that lim inf, 9 Fz, (x) => Fz(x — 1/k). Since 
the choice of k was arbitrary, and since Fz is everywhere continuous, we have that 


Fz, (x) =e z (x), completing the proof. Oo 


Lemma A.14 Let g € C}(R), and let sup,cp |g’ (x)| = C < ov. Let (Y, Z) be 
independent random variables such that E[Y] = E[Z], and E[Y*] = E[Z?]. If 
X is independent of Y and Z, we have 


Elg(X + ¥)—9(X +Z)]| < S (EVE +810). 


Proof Taylor’s theorem (Theorem A.1, p. 159) yields that 


1 
g(x + y) = g(x) + yg'(x) + ae "(x) + ays “(u), 


where u lies between x and x + y. It follows now by independence that 


sg(X + Y)] = Ele(X)] + BY IBLe’ O)] + SEW IELe" 00] + FEV") 


sg(X + Z)] = Elg(X)] + EIZIBIe'(X)] + SEIZ"IEle”(X)] + 2EIZ39""(V) 


for a random variable U that lies between X and X + Y almost surely, and a 
random variable V that lies between X and X + Z almost surely. Consequently, 
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our assumptions yield that 


1 1 
tle(X + ¥)—g(X + Z)l| = | ELY?e""(U)] - ZEIZ*e"V)] 
1 7 3 1 , 3AM 
< ZE|Y*¢"(U)| + -B|Z79"(V)| 
C 
<4 (EIY P+ E|Z|’). 


Lemma A.15 Let (Yet be a sequence of iid random variables such that 
E|Yi |? < 00, E[Y?] = 1, and E[Y\] = 0. If g € C3(R), then it holds that 


where Z ~ N(0, 1). 


Proof Let g € C;(R), andn > 1. Let (2; yey = N(O, 1) (independent of the {Y;}) 


and define 


¥,=Y;/J/n & Zi = Z;/ Jn. 


Since {Z;}"_, _ N(0, 1/n), it follows that ~"_, Z; ~ N(0, 1) (by Corollary 1.35, 


p. 25). It thus suffices to show that 


C EY?) + E[|Z,|*] 
6 Jn 


ilg(1%1 Spree Y,)] ~~ i[g(Z1 = -++Zn) = (A.5) 


for C = sup, cr |g’”(x)| < 00. Define 


GU, =Yy+--+¥Y ++ Zi414+---+ Zr 
Vi=Y+---+ ¥i-1+0+ Zi4i ++: + Zn 


and observe that these satisfy 


U; =Vi+ ¥; & Ui=Vit+Z; 
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so that we may re-write the left-hand side of Eq. (A.5) as 


“[¢(Un)] — Elg(Uo)] = D> Elg(U;)] — Elg(Ui-1))) 


i=1 


= >) Elg(v; + ¥;))]-Elg(Vi + Z;))). 


i=1 


We now use Lemma A. 14 to bound the last expression by 


YS NV BZN) = 2S? NFP) + B12) 


i=1 


thus establishing the validity of inequality A.5, and completing the proof. oO 


Theorem A.16 (Third Moment Central Limit Theorem) Let Y,,..., Y,, be tid 
random variables such that E[Y¥;] = j < 0, Var[Yi] = o*, and E|Y;|? < ox. 
Le Vp = 7 Me hen, 


Jn(¥, — ) > N(0,0?). 


Proof The random variables Y= hu satisfy the conditions of Lemma A.15. 


Thus, if we define 


_ Hoty _ Jn(¥n- 1) 
Jn _ o , 


we must have 


noo 


‘Ig (Zn)|"—> Ele(Z)], Ve € CAR), 


for Z ~ N(O, 1). Lemma A.13 now implies that Fz, (x) ">> Fz(x) for all x € R, 
— d 
and sooZ, = J/n(Yp — 4) —> N(0,0?). Oo 
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